AI FOR SECURITY - PROJECT 2024/2025¶
- Mattia Campana - 10712872
- Camilla Foresti - 10739259
- Filippo Gotta - 10774461
INTRODUCTION¶
The Internet of Medical Things (IoMT) has become increasingly vital in healthcare, enabling continuous patient monitoring and automated medical services. However, this connectivity also introduces cybersecurity risks that could compromise patient care and privacy. The CICIoMT2024 dataset, developed by the Canadian Institute for Cybersecurity (CIC), provides a comprehensive benchmark for evaluating IoMT security solutions.
Dataset Overview¶
- Contains network traffic from 40 IoMT devices (25 real and 15 simulated)
- Includes traffic across multiple protocols: Wi-Fi, MQTT, and Bluetooth
- Features 18 different types of attacks categorized into:
- DDoS (Distributed Denial of Service)
- DoS (Denial of Service)
- Reconnaissance
- MQTT-specific attacks
- Spoofing
Project Objectives¶
- Analyze a subset of the CICIoMT2024 dataset to develop efficient security models
- Implement and evaluate both supervised and unsupervised learning approaches
- Compare model performance across different attack categories
- Identify the most effective model for real-time threat detection in IoMT networks
Our analysis will focus on the trade-off between model complexity and detection accuracy, aiming to provide practical insights for securing healthcare IoT infrastructure.
LIBRARIES¶
Data Manipulation, Preprocessing and Analysis¶
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
Data Visualisation¶
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from sklearn.manifold import TSNE
from tabulate import tabulate
from tqdm import tqdm
from rich import print
from rich.tree import Tree
from rich.panel import Panel
Text Processing¶
import re
Machine Learning¶
Models¶
# Supervised
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
# Neural Network (supervised)
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
# Unsupervised
## Clustering (unsupervised)
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from sklearn.cluster import KMeans, DBSCAN
## Anomaly Detection (unsupervised)
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
Optimisation¶
from skopt import BayesSearchCV
from sklearn.model_selection import GridSearchCV, StratifiedKFold
Evaluation¶
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score, silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.metrics import make_scorer, roc_curve, roc_auc_score
from sklearn.calibration import CalibrationDisplay
System/File Operations¶
import os
from pathlib import Path
import shutil
from typing import Tuple, Dict, Literal, Optional, List
from collections import defaultdict
import time
from joblib import dump, load
import json
SEEDING¶
To make our code reproducible, we set a random seed everytime a randomised operation is performed. With no specific reason we decided to use each time the number 37. Wait... maybe there's a reason...
https://youtu.be/d6iQrh2TK98?si=174Z_qsxlW1cdiDd - "Why is this number everywhere?", Veritasium
http://thirty-seven.org - 37 Heaven
1. DATASET PREPARATION¶
Download Dataset¶
Create the directory and download the whole CSV dataset through wget
mkdir dataset
cd dataset/ && wget -c -r -np -nH --cut-dirs=3 -R "index.html*" http://205.174.165.80/IOTDataset/CICIoMT2024/Dataset/WiFI_and_MQTT/attacks/CSV/
Analyse Dataset¶
The dataset consists of network traffic files organized by protocol and specific attack type. Each row concerns a grouped quantity of traffic packets, that spans from 10 to 100 packets, and for this reason some features are representative not of a single event but they are statistical analysis for the whole group considered.
Due to the diverse distribution of data across files, the first step consisted in analyzing the volume of samples. This initial assessment was crucial as significant imbalances in data volume between attack types or protocols could skew our analysis.
Understanding distributions allowed us to create a balanced and representative subset of data for our analysis while ensuring computational efficiency.
In this phase, we want to explore the directory and provide a detailed analysis of its structure and content. It is important to remark that files are named based on the protocol and the attack type; in some cases there are multiple files concerning the same protocol and attack type, for this reason they have been numerated starting from zero. In order to avoid unbalance in data sampling, the following code allows us to group files based on unique base names. In this way, we obtain 19 files, 18 of which about attacks and 1 about benign data.
def get_size(start_path: str) -> int:
return sum(os.path.getsize(os.path.join(dirpath, f))
for dirpath, _, filenames in os.walk(start_path)
for f in filenames)
def explore_subdirectory(path: str, level: int = 0) -> Dict:
contents = os.listdir(path)
dirs = [d for d in contents if os.path.isdir(os.path.join(path, d))]
files = [f for f in contents if os.path.isfile(os.path.join(path, f))]
stats = {
'directory': os.path.basename(path),
'num_dirs': len(dirs),
'num_files': len(files),
'size_mb': get_size(path) / (1024 * 1024),
'unique_bases': len(set(re.sub(r'\d+', '', f) for f in files))
}
tree = Tree(f"📁 {stats['directory']}")
tree.add(f"Directories: {stats['num_dirs']}")
tree.add(f"Files: {stats['num_files']}")
tree.add(f"Size: {stats['size_mb']:.2f} MB")
tree.add(f"Unique base names: {stats['unique_bases']}")
print(Panel(tree, border_style="black"))
for subdir in dirs:
explore_subdirectory(os.path.join(path, subdir), level + 1)
return stats
_ = explore_subdirectory('dataset/original')
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ 📁 original │ │ ├── Directories: 2 │ │ ├── Files: 0 │ │ ├── Size: 2171.92 MB │ │ └── Unique base names: 0 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ 📁 test_og │ │ ├── Directories: 0 │ │ ├── Files: 21 │ │ ├── Size: 393.78 MB │ │ └── Unique base names: 19 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ 📁 train_og │ │ ├── Directories: 0 │ │ ├── Files: 51 │ │ ├── Size: 1778.14 MB │ │ └── Unique base names: 19 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Through the statistics above, it is possible to merge CSV files, based on their base names, within specified directories, respectively 'train_og' for training and 'test_og' for testing data. The code below allows for identifying all CSV files in the directory and grouping them by their base name. For each group containing more than one file, the function reads the files, concatenates them into a single DataFrame by merging their contents row-wise, and saves the resulting file back into the directory using the base name.
def merge_files_by_base_name(directory: str) -> None:
in_dir = os.path.join('dataset/original', directory + '_og')
out_dir = os.path.join('dataset/merged', directory)
if not os.path.exists(out_dir):
os.makedirs(out_dir)
# Get all files in the directory
files = [f for f in os.listdir(in_dir) if f.endswith('.csv')]
# Group files by base name (excluding trailing numbers)
grouped_files = defaultdict(list)
for file in files:
base_name = re.sub(r'\d+', '', file)
grouped_files[base_name].append(file)
# Merge files in each group
for base_name, file_list in grouped_files.items():
dfs = []
if len(file_list) <= 1:
src_file = os.path.join(in_dir, file_list[0])
dest_file = os.path.join(out_dir, f"{base_name}")
shutil.copy(src_file, dest_file)
continue
for file in file_list:
file_path = os.path.join(in_dir, file)
df = pd.read_csv(file_path)
dfs.append(df)
# Concatenate all dataframes in the group
merged_df = pd.concat(dfs, ignore_index=True)
# Save the merged dataframe
output_file = os.path.join(out_dir, f"{base_name}")
merged_df.to_csv(output_file, index=False)
merge_files_by_base_name('train')
merge_files_by_base_name('test')
_ = explore_subdirectory('dataset/merged')
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ 📁 merged │ │ ├── Directories: 2 │ │ ├── Files: 0 │ │ ├── Size: 2171.90 MB │ │ └── Unique base names: 0 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ 📁 train │ │ ├── Directories: 0 │ │ ├── Files: 19 │ │ ├── Size: 1778.12 MB │ │ └── Unique base names: 19 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ 📁 test │ │ ├── Directories: 0 │ │ ├── Files: 19 │ │ ├── Size: 393.78 MB │ │ └── Unique base names: 19 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
File Length Analysis¶
The following code analyzes and visualizes data distribution across files, implementing a sampling process based on file size, as follows:
- Files < 20k rows or benign files: retained entirely
- Files 20K - 100K rows: keep 20k rows + 20% of remaining data
- Files > 100K rows: apply previous rules + 10% of remaining data
This stratified sampling approach helps maintain dataset balance while managing computational resources. The visualization uses color coding (green/orange/red) to highlight size thresholds.
PLOT_CONFIG = {
'thresholds': {'small': 20000, 'medium': 100000},
'colors': {'small': 'green', 'medium': 'orange', 'large': 'red'},
'base_path': 'dataset'
}
def get_color_by_length(length: int) -> str:
if length < PLOT_CONFIG['thresholds']['small']:
return PLOT_CONFIG['colors']['small']
elif length < PLOT_CONFIG['thresholds']['medium']:
return PLOT_CONFIG['colors']['medium']
return PLOT_CONFIG['colors']['large']
def get_directory_path(name: str, directory_type: Literal['original', 'subset']) -> str:
return os.path.join(PLOT_CONFIG['base_path'], directory_type, f"{name}")
def get_file_lengths(data_dir: str) -> Dict[str, int]:
return {
file: len(pd.read_csv(os.path.join(data_dir, file)))
for file in os.listdir(data_dir)
if file.endswith('.csv') and not re.search(r'\d', file)
}
def plot_len(
name: str,
directory_type: Literal['merged', 'subset'] = 'merged',
sort_by: Literal['name', 'length'] = 'length',
ascending: bool = True
) -> None:
# Get data
data_dir = get_directory_path(name, directory_type)
if not os.path.exists(data_dir):
print(f"Directory not found: {data_dir}")
return
file_lengths = get_file_lengths(data_dir)
if not file_lengths:
print(f"No CSV files found in {data_dir}")
return
# Sort data
sorted_items = sorted(
file_lengths.items(),
key=lambda x: x[0] if sort_by == 'name' else x[1],
reverse=not ascending
)
files, lengths = zip(*sorted_items)
# Create and style plot
plt.figure(figsize=(15, 8))
bars = plt.bar(files, lengths)
if directory_type == "merged":
for bar in bars:
bar.set_color(get_color_by_length(bar.get_height()))
legend_elements = [
plt.Rectangle((0,0), 1, 1, facecolor=color, label=f'{label} rows')
for label, color in [('<20k', 'green'), ('20k-100k', 'orange'), ('>100k', 'red')]
]
plt.legend(handles=legend_elements)
plt.title(f'Number of Rows in Each CSV File - {name} Dataset ({directory_type})', pad=20)
plt.xlabel('File Names')
plt.ylabel('Number of Rows')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
# Save and display
os.makedirs('img', exist_ok=True)
plt.savefig(f'img/file_lengths_{name}_{directory_type}.png', bbox_inches='tight')
plt.show()
plot_len('train')
plot_len('test')
Subset¶
Create Directory¶
mkdir dataset/subset && cd dataset/subset
mkdir train && mkdir test
Subset Generation¶
This code implements our sampling strategy through random sampling, using a fixed random seed for reproducibility.
For each file, it applies the rules presented before. The stratified approach maintains data representativeness while reducing volume, with random sampling ensuring unbiased selection across each range.
SUB_CONFIG = {
'RANDOM_SEED': 37,
'CAP': 20000,
'FIRST': 0.2,
'SECOND': 0.1,
'DIRS': {
'original': 'dataset/merged',
'subset': 'dataset/subset'
}
}
def calculate_subset(df: pd.DataFrame) -> pd.DataFrame:
"""Create subset based on dataframe size."""
total_rows = len(df)
if total_rows < SUB_CONFIG['CAP']:
return df
# Get first 20k random rows
first = df.sample(n=SUB_CONFIG['CAP'], random_state=SUB_CONFIG['RANDOM_SEED'])
if total_rows < 100000:
# For files between 20k and 100k
remaining = df.drop(first.index)
additional = remaining.sample(frac=SUB_CONFIG['FIRST'],
random_state=SUB_CONFIG['RANDOM_SEED'])
return pd.concat([first, additional])
# For files over 100k
next_80k = df.drop(first.index).sample(
n=(100000-SUB_CONFIG['CAP']),
random_state=SUB_CONFIG['RANDOM_SEED']
)
twenty_percent = next_80k.sample(
frac=0.2,
random_state=SUB_CONFIG['RANDOM_SEED']
)
remaining = df.drop(first.index).drop(next_80k.index)
ten_percent = remaining.sample(
frac=0.1,
random_state=SUB_CONFIG['RANDOM_SEED']
)
return pd.concat([first, twenty_percent, ten_percent])
def process_file(file_path: str, output_path: str) -> tuple:
"""Process single file and return statistics."""
df = pd.read_csv(file_path)
original_size = len(df)
if 'Benign' in file_path:
df_subset = df.copy()
else:
df_subset = calculate_subset(df)
df_subset.to_csv(output_path, index=False)
return original_size, len(df_subset)
def create_subset(name: str) -> None:
"""Create subset for given dataset."""
np.random.seed(SUB_CONFIG['RANDOM_SEED'])
# Setup directories
og_dir = os.path.join(SUB_CONFIG['DIRS']['original'], f"{name}")
sub_dir = os.path.join(SUB_CONFIG['DIRS']['subset'], name)
os.makedirs(sub_dir, exist_ok=True)
print(f"Processing directory: {og_dir}")
total_orig_size = 0
total_sub_size = 0
# Process files
for file in os.listdir(og_dir):
if file.endswith('.csv') and not re.search(r'\d', file):
orig_size, sub_size = process_file(
os.path.join(og_dir, file),
os.path.join(sub_dir, file)
)
total_orig_size += orig_size
total_sub_size += sub_size
overall_reduction = ((total_orig_size - total_sub_size) / total_orig_size * 100)
print(f"Overall reduction: {total_orig_size} -> {total_sub_size} ({overall_reduction:.2f}% reduction)")
create_subset('train')
create_subset('test')
Processing directory: dataset/merged/train
Overall reduction: 7160831 -> 1235465 (82.75% reduction)
Processing directory: dataset/merged/test
Overall reduction: 1614182 -> 465367 (71.17% reduction)
def plot_comparison(name: str) -> None:
# Get data for both directories
original_dir = get_directory_path(name, 'merged')
subset_dir = get_directory_path(name, 'subset')
if not all(os.path.exists(d) for d in [original_dir, subset_dir]):
print(f"One or both directories not found: {original_dir}, {subset_dir}")
return
# Get lengths for both datasets
original_lengths = get_file_lengths(original_dir)
subset_lengths = get_file_lengths(subset_dir)
if not all([original_lengths, subset_lengths]):
print("No CSV files found in one or both directories")
return
# Sort data by length
orig_items = sorted(original_lengths.items(), key=lambda x: x[1])
subset_items = sorted(subset_lengths.items(), key=lambda x: x[1])
orig_files, orig_lengths = zip(*orig_items)
subset_files, subset_lengths = zip(*subset_items)
# Create figure with adjusted size (less height needed without labels)
fig, ax1 = plt.subplots(figsize=(20, 8))
ax2 = ax1.twinx()
# Plot original dataset first (background)
bars1 = ax1.bar(range(len(orig_files)), orig_lengths,
label='Original Dataset', alpha=0.7,
color=[get_color_by_length(l) for l in orig_lengths],
zorder=1)
# Plot subset dataset second (foreground)
bars2 = ax2.bar(range(len(subset_files)), subset_lengths,
label='Processed Dataset', alpha=0.5, edgecolor='black',
color=[get_color_by_length(l) for l in subset_lengths],
zorder=2)
# Customize axes
ax1.set_ylabel('Number of Rows (Original)', color='darkblue', fontsize=12)
ax2.set_ylabel('Number of Rows (Processed)', color='darkred', fontsize=12)
# Remove x-axis labels completely
ax1.set_xticks([]) # Remove x-ticks
ax1.set_xticklabels([]) # Remove x-tick labels
# Add legends
color_legend = [
plt.Rectangle((0,0), 1, 1, facecolor=color, alpha=0.7, label=f'{label} rows')
for label, color in [('<20k', 'green'), ('20k-100k', 'orange'), ('>100k', 'red')]
]
dataset_legend = [
plt.Rectangle((0,0), 1, 1, facecolor='gray', alpha=0.7, label='Original'),
plt.Rectangle((0,0), 1, 1, facecolor='gray', alpha=0.5, label='Processed')
]
# Position legend
ax1.legend(handles=color_legend + dataset_legend,
loc='upper left')
# Add title
plt.title(f'Comparison of Row Counts - {name} Dataset',
pad=20,
fontsize=14)
# Adjust layout
plt.tight_layout()
# Save
plt.savefig(f'img/file_lengths_comparison_{name}.png',
bbox_inches='tight',
dpi=300)
plt.show()
plot_comparison("train")
plot_comparison("test")
Data Merging and Additional Feature Creation¶
After creating subsets, we merged the files into consolidated train and test datasets, adding three categorical features, with incremental level of specificity about the type of traffic:
is_benign: Malicious or not (boolean)category: Attack classification (DDoS, DoS, etc.)attack: Specific attack type
The mapping between file names and attack types was automated using regex pattern matching. This consolidation maintains attack classification information while simplifying subsequent analysis.
ATTACK_MAPPING: Dict[str, Tuple[int, str, str]] = {
r'^Benign': (1, 'BENIGN', 'Benign'),
r'^ARP_Spoofing': (0, 'SPOOFING', 'ARP Spoofing'),
r'^Recon-Ping_Sweep': (0, 'RECON', 'Ping Sweep'),
r'^Recon-VulScan': (0, 'RECON', 'Recon VulScan'),
r'^Recon-OS_Scan': (0, 'RECON', 'OS Scan'),
r'^Recon-Port_Scan': (0, 'RECON', 'Port Scan'),
r'^MQTT-Malformed_Data': (0, 'MQTT', 'Malformed Data'),
r'^MQTT-DoS-Connect_Flood': (0, 'MQTT', 'DoS Connect Flood'),
r'^MQTT-DDoS-Publish_Flood': (0, 'MQTT', 'DDoS Publish Flood'),
r'^MQTT-DoS-Publish_Flood': (0, 'MQTT', 'DoS Publish Flood'),
r'^MQTT-DDoS-Connect_Flood': (0, 'MQTT', 'DDoS Connect Flood'),
r'TCP_IP-DoS-TCP': (0, 'DoS', 'DoS TCP'),
r'TCP_IP-DoS-ICMP': (0, 'DoS', 'DoS ICMP'),
r'TCP_IP-DoS-SYN': (0, 'DoS', 'DoS SYN'),
r'TCP_IP-DoS-UDP': (0, 'DoS', 'DoS UDP'),
r'TCP_IP-DDoS-SYN': (0, 'DDoS', 'DDoS SYN'),
r'TCP_IP-DDoS-TCP': (0, 'DDoS', 'DDoS TCP'),
r'TCP_IP-DDoS-ICMP': (0, 'DDoS', 'DDoS ICMP'),
r'TCP_IP-DDoS-UDP': (0, 'DDoS', 'DDoS UDP')
}
def get_category_and_attack(filename: str) -> Tuple[int, str, str]:
"""Get category and attack type from filename."""
for pattern, (is_benign, category, attack) in ATTACK_MAPPING.items():
if re.match(pattern, filename):
return is_benign, category, attack
return -1, 'UNKNOWN', 'UNKNOWN'
def process_files_in_chunks(directory: str, output_file: str, chunk_size: int = 100000) -> None:
"""Process CSV files in chunks and write directly to output file."""
files = [f for f in os.listdir(directory) if f.endswith('.csv')]
print(f"Found {len(files)} CSV files in {directory}")
# Write header first
first_file = True
for filename in files:
filepath = os.path.join(directory, filename)
is_benign, category, attack = get_category_and_attack(filename)
# Process the file in chunks
for chunk in pd.read_csv(filepath, chunksize=chunk_size):
chunk['is_benign'] = is_benign
chunk['category'] = category
chunk['attack'] = attack
# Write to output file
chunk.to_csv(output_file,
mode='w' if first_file else 'a',
header=first_file,
index=False)
first_file = False
base_dir = 'dataset/subset'
for dataset in ['train', 'test']:
output_file = f'dataset/{dataset}_labeled.csv'
# Process files
process_files_in_chunks(
directory=os.path.join(base_dir, dataset),
output_file=output_file,
chunk_size=100000
)
Found 19 CSV files in dataset/subset/train
Found 19 CSV files in dataset/subset/test
train_df = pd.read_csv('dataset/train_labeled.csv')
test_df = pd.read_csv('dataset/test_labeled.csv')
# Get counts
train_categories = train_df['category'].value_counts()
test_categories = test_df['category'].value_counts()
train_attacks = train_df['attack'].value_counts()
test_attacks = test_df['attack'].value_counts()
# Create comparative tables
def create_comparison_table(train_counts, test_counts):
"""Create a comparison table with train and test counts."""
# Get all unique indices
all_indices = sorted(set(train_counts.index) | set(test_counts.index))
# Create table rows
table = [
[
index,
f"{train_counts.get(index, 0):,}",
f"{test_counts.get(index, 0):,}"
]
for index in all_indices
]
# Sort by total count (train + test)
table.sort(key=lambda x: (
float(x[1].replace(',', '')) + float(x[2].replace(',', ''))
), reverse=True)
return table
# Print total rows
print(f"\nTotal Rows:")
print(f"Train: {len(train_df):,}")
print(f"Test: {len(test_df):,}")
# Print category distribution
print("\nCategory Distribution:")
category_table = create_comparison_table(train_categories, test_categories)
print(tabulate(
category_table,
headers=['Category', 'Train Count', 'Test Count'],
tablefmt='psql'
))
# Print attack distribution
print("\nAttack Type Distribution:")
attack_table = create_comparison_table(train_attacks, test_attacks)
print(tabulate(
attack_table,
headers=['Attack Type', 'Train Count', 'Test Count'],
tablefmt='psql'
))
Total Rows:
Train: 1,235,465
Test: 465,367
Category Distribution:
+------------+---------------+--------------+ | Category | Train Count | Test Count | |------------+---------------+--------------| | DDoS | 581,986 | 210,677 | | DoS | 284,552 | 143,579 | | BENIGN | 192,732 | 37,607 | | MQTT | 107,607 | 46,182 | | RECON | 52,541 | 25,578 | | SPOOFING | 16,047 | 1,744 | +------------+---------------+--------------+
Attack Type Distribution:
+--------------------+---------------+--------------+ | Attack Type | Train Count | Test Count | |--------------------+---------------+--------------| | DDoS UDP | 189,596 | 62,207 | | DDoS ICMP | 179,748 | 60,970 | | Benign | 192,732 | 37,607 | | DDoS TCP | 106,446 | 44,260 | | DDoS SYN | 106,196 | 43,240 | | DoS UDP | 82,695 | 39,755 | | DoS SYN | 70,190 | 35,719 | | DoS ICMP | 67,629 | 35,686 | | DoS TCP | 64,038 | 32,419 | | DDoS Connect Flood | 43,304 | 24,383 | | Port Scan | 32,796 | 20,524 | | DoS Publish Flood | 24,875 | 8,505 | | DDoS Publish Flood | 21,525 | 8,416 | | OS Scan | 16,832 | 3,834 | | ARP Spoofing | 16,047 | 1,744 | | DoS Connect Flood | 12,773 | 3,131 | | Malformed Data | 5,130 | 1,747 | | Recon VulScan | 2,173 | 1,034 | | Ping Sweep | 740 | 186 | +--------------------+---------------+--------------+
2. ANALYSING AND SELECTING FEATURES¶
# Load the merged dataset
df = pd.read_csv('dataset/train_labeled.csv')
Checking for NULL values¶
# Count the number of features with null values
null_counts = df.isnull().sum()
features_with_nulls = null_counts[null_counts > 0]
# Create a DataFrame for display
null_features_df = pd.DataFrame({'Feature': features_with_nulls.index, 'Null Count': features_with_nulls.values})
# Print the table
print(tabulate(null_features_df, headers='keys', tablefmt='psql'))
+-----------+--------------+ | Feature | Null Count | |-----------+--------------| +-----------+--------------+
Good news: there are not features with NULL values
Checking features type¶
# Print type and count of features
type_counts = df.dtypes.value_counts()
print(tabulate(pd.DataFrame({'Type': type_counts.index, 'Count': type_counts.values}), headers='keys', tablefmt='psql'))
+----+---------+---------+ | | Type | Count | |----+---------+---------| | 0 | float64 | 45 | | 1 | object | 2 | | 2 | int64 | 1 | +----+---------+---------+
We can observe that all features are float, except those we created to label the entries.
Statistical Analysis¶
The following code performs statistical analysis on the merged dataset to uncover feature relationships and characteristics. Through correlation matrices, variance metrics, and zero/missing value analysis, it reveals redundant features and quality issues that inform feature selection and preprocessing decisions.
df_numeric = df.select_dtypes(include=['float64', 'int64'])
# Correlation Analysis
corr = df_numeric.corr(method = 'pearson')
# Create visualizations
plt.figure(figsize=(15, 10))
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Draw the heatmap with the mask and correct aspect ratio
cmap = sns.diverging_palette(12, 150, s=100, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5},
xticklabels=True, yticklabels=True)
plt.title('Feature Correlation Matrix ')
plt.tight_layout()
plt.savefig('img/correlation.png', bbox_inches='tight')
plt.show()
# Print correlations
print("\n--- Features with >90% correlation ---")
corr_matrix = corr
high_corr = np.where(np.abs(corr_matrix) >= 0.9)
high_corr_data = []
for i, j in zip(*high_corr):
if i < j: # avoid duplicate pairs
high_corr_data.append([
corr_matrix.index[i],
corr_matrix.columns[j],
corr_matrix.iloc[i, j]
])
print(tabulate(high_corr_data,
headers=["Feature 1", "Feature 2", "Correlation"],
tablefmt="grid",
floatfmt=".5f"))
--- Features with >90% correlation ---
+---------------+-------------+---------------+ | Feature 1 | Feature 2 | Correlation | +===============+=============+===============+ | Protocol Type | UDP | 0.93744 | +---------------+-------------+---------------+ | Rate | Srate | 1.00000 | +---------------+-------------+---------------+ | ARP | IPv | -1.00000 | +---------------+-------------+---------------+ | ARP | LLC | -1.00000 | +---------------+-------------+---------------+ | IPv | LLC | 1.00000 | +---------------+-------------+---------------+ | Tot sum | AVG | 0.90472 | +---------------+-------------+---------------+ | Max | Magnitue | 0.93071 | +---------------+-------------+---------------+ | AVG | Tot size | 0.97746 | +---------------+-------------+---------------+ | AVG | Magnitue | 0.97982 | +---------------+-------------+---------------+ | Std | Radius | 0.99997 | +---------------+-------------+---------------+ | Std | Covariance | 0.95656 | +---------------+-------------+---------------+ | Tot size | Magnitue | 0.95790 | +---------------+-------------+---------------+ | IAT | Number | 0.99914 | +---------------+-------------+---------------+ | IAT | Weight | 0.99933 | +---------------+-------------+---------------+ | Number | Weight | 0.99995 | +---------------+-------------+---------------+ | Radius | Covariance | 0.95657 | +---------------+-------------+---------------+
# Feature Variance
variance = df_numeric.var().sort_values(ascending=False)
# Print variance
print("\n--- Top 10 High Variance Features ---")
variance_data = [[feature, value] for feature, value in
variance.head(10).items()]
print(tabulate(variance_data,
headers=["Feature", "Variance"],
tablefmt="grid",
floatfmt=".5f"))
--- Top 10 High Variance Features ---
+---------------+------------------------+ | Feature | Variance | +===============+========================+ | IAT | 1549362499752095.75000 | +---------------+------------------------+ | Header_Length | 405017979418.82312 | +---------------+------------------------+ | Covariance | 1661236197.27687 | +---------------+------------------------+ | Srate | 1197455513.77209 | +---------------+------------------------+ | Rate | 1197455513.77209 | +---------------+------------------------+ | Tot sum | 4509439.03181 | +---------------+------------------------+ | rst_count | 1378529.44909 | +---------------+------------------------+ | Max | 74750.58662 | +---------------+------------------------+ | AVG | 34443.39249 | +---------------+------------------------+ | Tot size | 34366.50876 | +---------------+------------------------+
# Zero Values Percentage
zero_percent = (df_numeric == 0).sum() / len(df_numeric) * 100
# Print features with high zero percentage
print("\n--- Top 10 Zero Value Percentages ---")
zero_data = [[feature, value] for feature, value in
zero_percent.sort_values(ascending=False).head(10).items()]
print(tabulate(zero_data,
headers=["Feature", "Zero %"],
tablefmt="grid",
floatfmt=".5f"))
--- Top 10 Zero Value Percentages ---
+-----------------+-----------+ | Feature | Zero % | +=================+===========+ | Drate | 100.00000 | +-----------------+-----------+ | DHCP | 99.99409 | +-----------------+-----------+ | IGMP | 99.98867 | +-----------------+-----------+ | cwr_flag_number | 99.98810 | +-----------------+-----------+ | ece_flag_number | 99.98462 | +-----------------+-----------+ | IRC | 99.93128 | +-----------------+-----------+ | SMTP | 99.93055 | +-----------------+-----------+ | Telnet | 99.92990 | +-----------------+-----------+ | SSH | 99.92788 | +-----------------+-----------+ | HTTP | 99.46749 | +-----------------+-----------+
Selection Method 1: Manual Selection¶
We can see, both from the correlation matrix and from the percentages of zero values, that Drate column is empty: we dropped it.
df=df.drop('Drate', axis=1)
This was the only trivial choice. However, further selection is needed: we can see that there is high correlation within data, and this can negatively affect the quality of our models. For this reason, we want to keep only non-redundant features to maximise the discriminatory power of the variables.
We based our first approach on the simple observation of the output of the previous analysis. Through a combination of correlation and variance, we decided to drop the following columns:
to_drop = ["Srate", "Protocol Type", "IPv", "LLC", "Tot sum", "Tot size", "AVG", "Max", "Number", "Weight"]
df_sel_manual = df.drop(columns=to_drop)
As a starting point, in order to evaluate our dataset and choices, we set up a first Classification model. We want to predict if a packet flow is benign or malicious using a Random Forest. Since we have an abundance of data, we split the training dataset into Train and Validation sets, to avoid overfitting the test dataset in this feature selection phase.
# Splitting the Dataset in Training and Validation sets
X_train, X_val, y_train, y_val = train_test_split(
df_sel_manual.drop(['is_benign', 'category', 'attack'], axis=1),
df_sel_manual['is_benign'], train_size=0.7, random_state=37
)
# Training the RandomForest
rnd_forest = RandomForestClassifier(random_state=37)
rnd_forest.fit(X_train, y_train);
dump(rnd_forest, 'models/random_forest_sel_man.joblib');
A useful metric that the RandomForest ensemble provides is the Feature Importance (Gini index). This can be used for further refinement of our feature selection.
feature_importances_dict = dict(zip(X_train.columns, rnd_forest.feature_importances_))
# Sort features by importance (descending order)
sorted_dict = dict(sorted(feature_importances_dict.items(), key=lambda x: x[1], reverse=True))
# Create plot
plt.figure(figsize=(10, 6))
sns.barplot(
x=list(sorted_dict.values()),
y=list(sorted_dict.keys()),
hue=list(sorted_dict.keys()), # Add this line
palette="Blues_r"
)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importances", fontsize=14, fontweight="bold")
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Evaluating the performance on the validation set, using common metrics:
def compute_metrics(y_true, y_pred):
"""
Print comprehensive model evaluation metrics with both rich text output and seaborn heatmap.
Parameters:
-----------
y_true : array-like
True labels
y_pred : array-like
Predicted labels
"""
# Calculate core metrics
metrics = {
"Accuracy": accuracy_score(y_true, y_pred),
"Precision": precision_score(y_true, y_pred),
"Recall": recall_score(y_true, y_pred),
"F1 Score": f1_score(y_true, y_pred)
}
# Create metrics table
table_data = [[metric, f"{value:.5f}"] for metric, value in metrics.items()]
table = tabulate(table_data, headers=["Metric", "Score"], tablefmt="psql")
print(table)
return metrics
y_pred = rnd_forest.predict(X_val)
with open('models/random_forest_sel_man.json', 'w') as f:
json.dump(compute_metrics(y_val, y_pred), f);
+-----------+---------+ | Metric | Score | |-----------+---------| | Accuracy | 0.99667 | | Precision | 0.98413 | | Recall | 0.99477 | | F1 Score | 0.98942 | +-----------+---------+
Selection Method 2: Hierarchial Clustering Selection¶
The results are good, but now let's try selecting the features with a more robust and reliable approach. This time, beyond the correlation calculation (same as we did before), we compute a Hierarchical Clustering, to group the features which are more correlated. We selected Euclidean distance and complete-linkage.
# Parameters
corr_method = 'pearson'
link_method = 'complete'
dist_metric = 'euclidean'
# Compute correlation matrix, excluding categorical columns
corr = df.drop(['is_benign', 'attack', 'category'], axis=1).corr(method=corr_method)
cmap = sns.diverging_palette(12, 150, s=100, as_cmap=True)
# Create initial clustermap to get linkage matrix
g = sns.clustermap(
corr,
annot=False,
cbar=True,
cmap=cmap,
center=0,
linewidths=.5,
cbar_kws={"shrink": .5},
metric=dist_metric,
method=link_method,
xticklabels=True,
yticklabels=True
)
g.figure.suptitle(
f"Clustermap (ClusterMethod: {link_method}, DistanceMetric: {dist_metric}, CorrelationMethod: {corr_method})",
y=1.05
)
# Create detailed visualization with dendrogram and correlation matrix
linkage_matrix = g.dendrogram_col.linkage
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
# Left plot: Dendrogram
dendrogram(
linkage_matrix,
ax=ax1,
labels=corr.columns
)
ax1.set_title("Hierarchical Clustering Dendrogram")
# Right plot: Reordered correlation matrix
dendro_idx = g.dendrogram_col.reordered_ind
reordered_corr = corr.iloc[dendro_idx, dendro_idx]
sns.heatmap(
reordered_corr,
annot=False,
cmap=cmap,
cbar=True,
ax=ax2,
cbar_kws={"shrink": .5},
linewidths=0.5,
square=True,
xticklabels=True,
yticklabels=True
)
ax2.set_title("Correlation Matrix")
# Final layout adjustments
fig.suptitle(
f"Clustermap (ClusterMethod: {link_method}, DistanceMetric: {dist_metric}, CorrelationMethod: {corr_method})",
y=1.05
)
plt.tight_layout()
plt.show()
Our decision in this step is constrained to selecting the distance threshold by visually inspecting the dendrogram. The remaining process is automated by the script, which identifies the feature with the highest importance (based on Gini) for each cluster formed at the chosen threshold.
threshold = 1.0
# Group features by cluster
clusters = fcluster(linkage_matrix, t=threshold, criterion='distance')
clustered_features = {}
for idx, cluster in enumerate(clusters):
if cluster not in clustered_features:
clustered_features[cluster] = []
clustered_features[cluster].append(corr.columns[idx])
# Print clusters in a clean, sorted format
print("Clustered Features:\n" + "-"*20)
# Sort by cluster number for better readability
for cluster in sorted(clustered_features.keys()):
features = clustered_features[cluster]
# Join features with commas and proper spacing
feature_list = ", ".join(features)
print(f"Cluster {cluster:2d}: {feature_list}")
Clustered Features: --------------------
Cluster 1: Rate, Srate
Cluster 2: Protocol Type, UDP
Cluster 3: IPv, LLC
Cluster 4: ICMP
Cluster 5: IAT, Number, Weight
Cluster 6: ece_flag_number, cwr_flag_number
Cluster 7: Telnet, SMTP, IRC
Cluster 8: SSH
Cluster 9: DHCP
Cluster 10: IGMP
Cluster 11: DNS
Cluster 12: Duration
Cluster 13: HTTP
Cluster 14: syn_flag_number
Cluster 15: syn_count
Cluster 16: fin_flag_number, ack_count
Cluster 17: rst_flag_number
Cluster 18: ARP
Cluster 19: psh_flag_number, ack_flag_number, Variance
Cluster 20: rst_count
Cluster 21: TCP
Cluster 22: Std, Radius, Covariance
Cluster 23: fin_count
Cluster 24: Tot sum, Max, AVG, Tot size, Magnitue
Cluster 25: Header_Length, HTTPS
Cluster 26: Min
# Prepare data for Random Forest
X = df.drop(['is_benign', 'category', 'attack'], axis=1)
y = df['is_benign']
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=37)
# Train Random Forest and get feature importances
rnd_forest.fit(X_train, y_train)
feature_importances = dict(zip(X_train.columns, rnd_forest.feature_importances_))
# Sort features by importance
sorted_features = sorted(
feature_importances.items(),
key=lambda x: x[1],
reverse=True
)
# Select most important feature from each cluster
selected_features = []
for cluster, features in clustered_features.items():
# Get importance scores for features in this cluster
cluster_importances = {
feature: feature_importances[feature]
for feature in features
}
# Select feature with highest importance
best_feature = max(cluster_importances, key=cluster_importances.get)
selected_features.append(best_feature)
print("\nSelected features from each cluster:")
for feature in selected_features:
print(f" - {feature}")
Selected features from each cluster:
- Header_Length
- UDP
- Duration
- Rate
- ack_count
- syn_flag_number
- rst_flag_number
- Variance
- ece_flag_number
- syn_count
- fin_count
- rst_count
- HTTP
- DNS
- SMTP
- SSH
- TCP
- DHCP
- ARP
- ICMP
- IGMP
- IPv
- AVG
- Min
- Covariance
- IAT
Let's train a new RandomForest with the dataset containing the selected features:
selected_features.extend(['is_benign', 'attack', 'category'])
df_sel_hclust = df[selected_features]
X_train, X_val, y_train, y_val = train_test_split(
df_sel_hclust.drop(['is_benign', 'category', 'attack'], axis=1),
df_sel_hclust['is_benign'], train_size=0.7, random_state=37
)
rnd_forest.fit(X_train, y_train);
dump(rnd_forest, 'models/random_forest_sel_hclust.joblib');
feature_importances_dict = dict(zip(X_train.columns, rnd_forest.feature_importances_))
# Sort features by importance (descending order)
sorted_dict = dict(sorted(feature_importances_dict.items(), key=lambda x: x[1], reverse=True))
# Create plot
plt.figure(figsize=(10, 6))
sns.barplot(
x=list(sorted_dict.values()),
y=list(sorted_dict.keys()),
hue=list(sorted_dict.keys()),
palette="Blues_r"
)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importances", fontsize=14, fontweight="bold")
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Let's test the performance of the new approach used for feature selection:
y_pred = rnd_forest.predict(X_val)
with open('models/random_forest_sel_hclust.json', 'w') as f:
json.dump(compute_metrics(y_val, y_pred), f);
+-----------+---------+ | Metric | Score | |-----------+---------| | Accuracy | 0.99661 | | Precision | 0.98381 | | Recall | 0.99472 | | F1 Score | 0.98924 | +-----------+---------+
# Load the performance metrics
with open('models/random_forest_sel_hclust.json', 'r') as f:
metrics_sel_hclust = json.load(f)
# Load the performance metrics
with open('models/random_forest_sel_man.json', 'r') as f:
metrics_sel_manual = json.load(f)
# Metric names
metric_names = ["Accuracy", "Precision", "Recall", "F1 Score"]
# Combine into a table
table = zip(metric_names, [metrics_sel_hclust[m] for m in metric_names], [metrics_sel_manual[m] for m in metric_names])
# Print the table
print(tabulate(table, headers=["Metric", "HClust Selection", "Manual Selection"], floatfmt=".3f"))
Metric HClust Selection Manual Selection --------- ------------------ ------------------ Accuracy 0.997 0.997 Precision 0.984 0.984 Recall 0.995 0.995 F1 Score 0.989 0.989
These results validate both our methods and provide an indication of the strong performance of the Random Forest model.
Let's save the new dataset with the selected features:
df_sel_hclust.to_csv('dataset/train_sel_hclust.csv', index=False)
3. FEATURES VISUALISATION¶
Let's compare the distribution of the attacks with the benign data flows, using the top 5 most important features:
df_sel = pd.read_csv('dataset/train_sel_hclust.csv')
def plot_binary_distributions(df, features):
# Apply Min-Max scaling
scaler = MinMaxScaler()
df[features] = scaler.fit_transform(df[features])
df_copy = df.sample(frac=0.1, random_state=37)
plt.figure(figsize=(20, 10))
for i, feature in enumerate(features):
plt.subplot(2, (len(features) + 1) // 2, i + 1)
sns.kdeplot(data=df_copy, x=feature, hue='is_benign',
hue_order=[0, 1],
fill=True, common_norm=False,
palette={0: 'red', 1: 'green'}) # Fixed colors for benign and malicious
plt.title(f'Benign vs Malicious - {feature}')
plt.xlabel(feature)
plt.legend(title='NATURE', labels=['benign', 'malicious'])
plt.tight_layout()
plt.show()
def plot_attack_distributions(df, feature):
# Apply Min-Max scaling
scaler = MinMaxScaler()
df[feature] = scaler.fit_transform(df[[feature]])
plt.figure(figsize=(20, 10))
attacks = ['SPOOFING', 'RECON', 'MQTT', 'DoS', 'DDoS']
for attack in attacks:
data = df[df['category'].isin([attack, 'BENIGN'])]
plt.subplot(2, 3, attacks.index(attack) + 1)
sns.kdeplot(data=data, x=feature, hue='category',
hue_order=['BENIGN', attack],
fill=True, common_norm=False,
palette={attack: 'red', 'BENIGN': 'green'})
plt.title(f'BENIGN vs {attack}')
plt.xlabel(feature)
plt.tight_layout()
plt.show()
feature_importances_dict = dict(
zip(df_sel.drop(['is_benign', 'category', 'attack'], axis=1)
, rnd_forest.feature_importances_))
# Sort features by importance and get top 5 feature names
top_5_features = [feature[0] for feature in sorted(
feature_importances_dict.items(),
key=lambda x: x[1],
reverse=True
)[:5]]
Benign vs Malicious¶
plot_binary_distributions(df_sel.copy(), top_5_features)
Benign vs Attack Category¶
for feature_name in top_5_features:
plot_attack_distributions(df_sel.copy(), feature_name)
rst_count (count of rst flags occurrences in packets): as concerns rst_count feature, the graphs show that its values are concentrated near zero in case of attacks, regardless of the specific attack category, while its distribution is wider and characterized by higher values for benign data. It can be a useful indicator in order to distinguish between the two classes.
HEADER_LENGTH (mean of the header lengths of the transport layer): except for spoofing, which shows a wider distribution of values, in the other cases the Header_Length can be discriminative between the two phenomena; the separation is less noticeable for MQTT.
Variance (variance of the packet lengths in the window): about Variance, the distinction in the distribution of the two classes is more remarkable in RECON, DoS and DDoS, since benign traffic is featured by higher values than attack-related data.
IAT (interval mean between current and previous packet in the window): on the one hand IAT doesn’t appear to be a strong indicator for differentiate attacks like spoofing and recon, but it certainly does for MQTT, DoS and DDoS, where the separation between malicious and benign traffic is very clear. In general, the values for this feature are concentrated between 0, 0.5 and 1.
RATE: the distributions of SPOOFING and RECON are almost inverted if compared with the other attack categories.
DoS and DDoS: it's not surprising that this two types of attack have similar distributions across all features.
4. SUPERVISED LEARNING¶
4.1 BINARY CLASSIFICATION¶
Data preprocessing¶
In order to have a balanced dataset for binary decision, we decided to mantain all benign flow and randomly drop entries of malicious samples in order to have the same quantity of data on both sides.
df = pd.read_csv('dataset/train_sel_hclust.csv')
def balance_dataset(df: pd.DataFrame, target_column: str = 'is_benign') -> pd.DataFrame:
# Separate the dataset into two groups
df_ones = df[df[target_column] == 1]
df_zeros = df[df[target_column] == 0]
# Determine the number of samples to keep from the zeros group
num_ones = len(df_ones)
df_zeros_balanced = df_zeros.sample(n=num_ones, random_state=37)
# Concatenate the balanced zeros with the ones
df_balanced = pd.concat([df_ones, df_zeros_balanced], ignore_index=True)
# Shuffle the dataset
df_balanced = df_balanced.sample(frac=1, random_state=37).reset_index(drop=True)
df_balanced.to_csv('dataset/train_binary.csv', index=False)
return df_balanced
# Balance the dataset
df = balance_dataset(df)
# Display the distribution of the target column
print(df['is_benign'].value_counts())
is_benign 1 192732 0 192732 Name: count, dtype: int64
Train-Validation split:
# Separate features and target
X = df.drop(['is_benign', 'category', 'attack'], axis=1)
y = df['is_benign']
# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=37)
We defined the following function to calculate for each model the chosen metrics and plot the confusion matrix. We used multiple metrics to maximise the effectiveness of each model in detecting and classifying cyber threats:
- Accuracy measures the overall correctness of the model by assessing the proportion of correctly classified instances.
- Precision ensures the model's reliability in identifying true positives without excessive false alarms.
- Recall (also called sensitivity or true positive rate) measures the model's ability to detect all relevant instances, especially critical in cybersecurity to avoid overlooking threats.
- F1-score balances between precision and recall.
def evaluate_model(y_true, y_pred, model_name="Model", class_names=None):
"""
Print comprehensive model evaluation metrics with both rich text output and seaborn heatmap.
Parameters:
-----------
y_true : array-like
True labels
y_pred : array-like
Predicted labels
model_name : str, optional
Name of the model for display purposes
class_names : list, optional
List of class names for axis labels
"""
# Calculate core metrics
metrics = {
"Accuracy": accuracy_score(y_true, y_pred),
"Precision": precision_score(y_true, y_pred),
"Recall": recall_score(y_true, y_pred),
"F1 Score": f1_score(y_true, y_pred)
}
# Create metrics table
table_data = [[metric, f"{value:.5f}"] for metric, value in metrics.items()]
table = tabulate(table_data, headers=["Metric", "Score"], tablefmt="psql")
print(table)
# Calculate and plot confusion matrix as heatmap
cm = confusion_matrix(y_true, y_pred)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
# Create heatmap with enhanced annotations
plt.figure(figsize=(18, 10))
if class_names is None:
class_names = [f"Class {i}" for i in range(cm.shape[0])]
sns.heatmap(
cm_normalized,
annot=True,
fmt='.2f',
cmap='Blues',
xticklabels=class_names,
yticklabels=class_names,
annot_kws={'size': 16, 'weight': 'bold'} # Increased font size and made bold
)
# Adjust title and label properties
plt.title(f'{model_name} - Normalized Confusion Matrix', fontsize=16, pad=20)
plt.xlabel('Predicted', fontsize=14, weight='bold')
plt.ylabel('True', fontsize=14, weight='bold')
# Adjust tick labels size
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()
return metrics
LOGISTIC REGRESSION¶
We started classifying using a linear method
Hyperparameters Optimisation
Optimising regularization parameter C (inverse of regularization strength), using GridSearch
# Define LogisticRegression and parameter grid
log_reg = LogisticRegression()
param_grid = {
'penalty': ['l2'],
'C': np.linspace(0.1, 4.0, 20),
'max_iter': [10000],
'random_state': [37]
}
# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)
# Set up GridSearchCV
search = GridSearchCV(
log_reg,
param_grid,
scoring=['accuracy', 'precision', 'recall', 'f1'],
refit='accuracy',
cv=cv,
verbose=True
)
TRAINING
# Start the time
start_time = time.time()
# Fit the grid search
search.fit(X_train, y_train);
# Stop the timer and print the result
elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")
best_model = search.best_estimator_
best_params = search.best_params_
# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Elapsed time 365.35 seconds
Best Parameters by Average Score:
+--------------+--------------------+ | Parameter | Value | |--------------+--------------------| | C | 3.3842105263157896 | | max_iter | 10000 | | penalty | l2 | | random_state | 37 | +--------------+--------------------+
def plot_hyperparameter(results):
# Plot the metrics against the parameter `C`
param_values = results['param_C'].data.astype(float)
metrics = ['mean_test_accuracy', 'mean_test_precision', 'mean_test_recall', 'mean_test_f1']
# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Define the zoom region (e.g., rightmost 30% of the data)
zoom_start = int(0.7 * len(param_values))
# Dictionary to store peak locations
peak_locations = {}
# Left subplot - full view
for metric in metrics:
values = results[metric]
line = ax1.plot(param_values, values, label=metric, linewidth=2)
color = line[0].get_color()
# Find the peak in the zoomed region
zoom_values = values[zoom_start:]
zoom_params = param_values[zoom_start:]
peak_idx = zoom_start + np.argmax(zoom_values)
peak_value = values[peak_idx]
peak_param = param_values[peak_idx]
peak_locations[metric] = (peak_param, peak_value)
# Add vertical line at peak location in full view
ax1.axvline(x=peak_param, color=color, linestyle='--', alpha=0.5)
ax1.set_xlabel('C (Inverse Regularization Strength)')
ax1.set_ylabel('Score')
ax1.set_title('Full Range View')
ax1.legend(loc='upper left')
ax1.grid(True)
# Right subplot - zoomed view
for metric in metrics:
values = results[metric]
line = ax2.plot(param_values, values, label=metric, linewidth=2)
color = line[0].get_color()
# Get the peak information
peak_param, peak_value = peak_locations[metric]
# Add a marker at the peak
ax2.plot(peak_param, peak_value, 'o', markersize=8,
color=color, markeredgecolor='black', markeredgewidth=2)
# Add annotation for the peak value
ax2.annotate(f'{peak_value:.3f}',
xy=(peak_param, peak_value),
xytext=(10, 5),
textcoords='offset points',
ha='left',
va='bottom',
bbox=dict(boxstyle='round,pad=0.5', fc='white', alpha=0.8),
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0'))
ax2.set_xlabel('C (Inverse Regularization Strength)')
ax2.set_title('Zoomed View of Peak Region')
ax2.grid(True)
# Set the x-axis limits for the zoomed view
ax2.set_xlim(param_values[zoom_start], param_values[-1])
# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()
# Plot the hyperparameter results
plot_hyperparameter(search.cv_results_)
VALIDATION
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "Logistic Regression", ["Benign", "Malicious"])
+-----------+---------+ | Metric | Score | |-----------+---------| | Accuracy | 0.91639 | | Precision | 0.92033 | | Recall | 0.91639 | | F1 Score | 0.91618 | +-----------+---------+
dump(best_model, 'models/log_reg_binary.joblib');
RANDOM FOREST¶
After the logistic regression, we implemented an ensemble learning method based on decision trees.
Choosing the number of trees to use in the ensemble
In a Random Forest, each tree is trained on a subset of the training data selected by bootstrap sampling. The "out-of-bag" (OOB) data is the portion of the training data that was not selected during the bootstrap sampling. The OOB error is a measure of the model's predictive performance.
Observing the OOB estimation variation with the number of trees helps understanding after how many trees the model stabilizes, thus what's the optimal number of trees to train our RandomForest on. The point where the OOB error stops improving/fluctuating significantly indicates that the model generalizes well to unseen data, and adding more trees does not meaningfully improve the model performance.
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
def calculate_oob_errors(X_train, y_train, n_trees_values, random_state=37):
"""Calculate OOB errors for a range of tree counts in RandomForest."""
oob_errors = []
for n_trees in n_trees_values:
temp_forest = RandomForestClassifier(n_estimators=n_trees, oob_score=True, random_state=random_state)
temp_forest.fit(X_train, y_train)
oob_errors.append(1 - temp_forest.oob_score_)
return oob_errors
# Define range of tree counts to evaluate
n_trees_values = range(1, 100, 5)
# Calculate OOB errors
oob_errors = calculate_oob_errors(X_train, y_train, n_trees_values)
# Plot the OOB error
plt.figure(figsize=(20, 12))
plt.plot(n_trees_values, oob_errors, marker='o', linestyle='-')
plt.xlabel("Number of Trees")
plt.ylabel("OOB Error")
plt.title("Out-of-Bag Error vs Number of Trees")
plt.grid(True)
plt.show()
Optimising Hyperparameters
We used a Grid Search, with 3-folds cross validation, to choose the best function to measure the quality of a split ('criterion') and the best maximum depth of the trees in our ensemble ('max_depth')
We also defined a custom scorer function to define what we intend as "best model": the one that maximises the average of our evaluation metrics.
def average_score(y_true, y_pred):
return (accuracy_score(y_true, y_pred) +
precision_score(y_true, y_pred) +
recall_score(y_true, y_pred) +
f1_score(y_true, y_pred)) / 4
average_scorer = make_scorer(average_score)
# Define RandomForestClassifier and parameter grid
rnd_forest = RandomForestClassifier()
param_grid = {
'criterion': ['gini', 'entropy', 'log_loss'],
'max_depth': [2, 5, 10, 15, 20, 25, None],
'random_state': [37]
}
# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)
# Set up GridSearchCV
search = GridSearchCV(
rnd_forest,
param_grid,
scoring=average_scorer, # Pass the custom scorer
refit=True,
cv=cv,
verbose=True,
)
TRAINING
start_time = time.time()
# Fit the grid search
search.fit(X_train, y_train);
elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")
# Access the best model and parameters
best_model = search.best_estimator_
best_params = search.best_params_
# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Fitting 3 folds for each of 21 candidates, totalling 63 fits
Elapsed time 972.89 seconds
Best Parameters by Average Score:
+--------------+---------+
| Parameter | Value |
|--------------+---------|
| criterion | entropy |
| max_depth | |
| random_state | 37 |
+--------------+---------+
VALIDATION
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "Random Forest", ["Benign", "Malicious"])
+-----------+---------+ | Metric | Score | |-----------+---------| | Accuracy | 0.99707 | | Precision | 0.99708 | | Recall | 0.99707 | | F1 Score | 0.99707 | +-----------+---------+
dump(best_model, 'models/rnd_forest_binary.joblib');
XGBOOST¶
In this section the aim is to check another ensemble model based on decision trees, in view of the results obtained using the Random Forest.
SETTING UP THE MODEL
average_scorer = make_scorer(average_score)
# Initial model
xgboost = xgb.XGBClassifier(
objective='binary:logistic',
scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
random_state=37
)
# Parameter grid
param_grid = {
'max_depth': [3, 5, 7],
'min_child_weight': [1, 3, 5],
'gamma': [0, 0.1, 0.2],
'subsample': [0.8, 0.9],
'colsample_bytree': [0.8, 0.9]
}
# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)
# Perform grid search
search = GridSearchCV(
xgboost,
param_grid,
cv=cv,
scoring=average_scorer,
n_jobs=-1,
verbose=1,
return_train_score=True
)
TRAINING
start_time = time.time()
# Fit the grid search
search.fit(X_train, y_train)
elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")
# Access the best model and parameters
best_model = search.best_estimator_
best_params = search.best_params_
# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Fitting 3 folds for each of 108 candidates, totalling 324 fits
Elapsed time 193.84 seconds
Best Parameters by Average Score:
+------------------+---------+ | Parameter | Value | |------------------+---------| | colsample_bytree | 0.9 | | gamma | 0.2 | | max_depth | 7 | | min_child_weight | 1 | | subsample | 0.9 | +------------------+---------+
VALIDATION
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "XGBoost", ["Benign", "Malicious"])
+-----------+---------+ | Metric | Score | |-----------+---------| | Accuracy | 0.99697 | | Precision | 0.99698 | | Recall | 0.99697 | | F1 Score | 0.99697 | +-----------+---------+
dump(best_model, 'models/xgboost_binary.joblib');
MODELS COMPARISON¶
LOADING MODELS
log_reg = load('models/log_reg_binary.joblib')
rnd_frs = load('models/rnd_forest_binary.joblib')
xgboost = load('models/xgboost_binary.joblib')
models = {
'Logistic Regression': log_reg,
'Random Forest': rnd_frs,
'XGBoost': xgboost
}
# Load the test dataset
df = pd.read_csv('dataset/train_sel_hclust.csv')
# Separate features and target
X = df.drop(['is_benign', 'category', 'attack'], axis=1)
y = df['is_benign']
# Split the data
_, X_val, _, y_val = train_test_split(X, y, train_size=0.7, random_state=37)
CALIBRATION¶
The calibration plot shown below compares the probabilities predicted by our models with the real observed frequency of positive values.
fig, ax_calibration_curve = plt.subplots(figsize=(10, 5))
for name, clf in models.items():
CalibrationDisplay.from_estimator(
clf,
X_val,
y_val,
name=name,
ax=ax_calibration_curve
)
ax_calibration_curve.set_xlim([0, 1])
ax_calibration_curve.set_ylim([0, 1])
ax_calibration_curve.grid()
ax_calibration_curve.set_title("Calibration plots")
plt.show()
We can see how the blue line of Logistic Regression is closer to the one representing the perfect calibration (where the predicted probability exactly matches the observed probability), thus is the best calibrated model.
ROC-AUC¶
Plotting ROC curve, and calculating the Area Under the Curve (AUC), helps us understand sensibility and specificity of the two models, representing the ratio between true positive rate and false positives rate (false alarms).
def plot_roc_curve(models, X_val, y_val):
# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
# Plot in both subplots
for name, model in models.items():
y_proba = model.predict_proba(X_val)
fpr, tpr, _ = roc_curve(y_val, y_proba[:, 1])
auc = roc_auc_score(y_val, y_proba[:, 1])
# Plot on the left subplot (full range)
ax1.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})', linewidth=2)
# Plot on the right subplot (zoomed)
ax2.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})', linewidth=2)
# Configure left subplot (full range)
ax1.plot([0, 1], [0, 1], 'k:', label='Random classifier')
ax1.set_xlim([-0.01, 1.01])
ax1.set_ylim([-0.01, 1.01])
ax1.grid(True, linestyle='--', alpha=0.4)
ax1.set_title('ROC Curve (Full Range)', fontsize=14)
ax1.set_xlabel('False Positive Rate', fontsize=12)
ax1.set_ylabel('True Positive Rate', fontsize=12)
ax1.legend(loc='lower right', fontsize=10)
# Configure right subplot (zoomed)
ax2.set_xlim([-0.01, 0.5]) # Focus on FPR from 0 to 0.3
ax2.set_ylim([0.7, 1.01]) # Focus on TPR from 0.7 to 1.0
ax2.grid(True, linestyle=':', alpha=0.4)
ax2.set_title('ROC Curve (Zoomed)', fontsize=14)
ax2.set_xlabel('False Positive Rate', fontsize=12)
ax2.set_ylabel('True Positive Rate', fontsize=12)
ax2.legend(loc='lower right', fontsize=10)
# Customize tick spacing for zoomed plot
ax2.xaxis.set_major_locator(plt.MultipleLocator(0.05))
ax2.yaxis.set_major_locator(plt.MultipleLocator(0.05))
# Add subtle box in the full range plot to show zoom area
rect = plt.Rectangle((-0.01, 0.8), 1.01, 0.21,
fill=False,
linestyle='--',
color='gray',
alpha=0.5)
ax1.add_patch(rect)
plt.tight_layout()
plt.show()
# Plot the ROC curve
plot_roc_curve(models, X_val, y_val)
The distribution of all our models is excellent, with the ensemble methods performing the best, as suggested by the AUC.
TESTING¶
df_test = pd.read_csv('dataset/test_labeled.csv')
# Remove columns from df_test that are not present in df_train
df_test = df_test[df.columns.intersection(df_test.columns)]
# Preprocess the data
X_test, y_test = df_test.drop(['is_benign', 'category', 'attack'], axis=1), df_test['is_benign']
def plot_model_metrics(models, X, y, name="VALIDATION"):
metrics = ['accuracy', 'precision', 'recall', 'f1']
x = np.arange(len(metrics))
width = 0.15
fig, ax = plt.subplots(figsize=(15, 6))
colors = sns.color_palette("Blues", n_colors=len(models)+2)[2:]
for i, ((model_name, model), color) in enumerate(zip(models.items(), colors)):
y_pred = model.predict(X)
metrics_values = [
accuracy_score(y, y_pred),
precision_score(y, y_pred),
recall_score(y, y_pred),
f1_score(y, y_pred)
]
rects = ax.bar(x + i * width, metrics_values, width, label=model_name, color=color)
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., height/2.,
f'{height:.2f}',
ha='center', va='center',
color='white' if height > 0.3 else 'black',
fontsize=10,
weight='bold')
ax.set_xlabel('Metrics', weight='bold')
ax.set_title(f'Comparison of Classification Metrics ({name})', weight='bold')
ax.set_xticks(x + width * ((len(models)-1)/2))
ax.set_xticklabels(metrics)
ax.set_ylim([0, 1])
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
VALIDATION SET
plot_model_metrics(models, X_val, y_val)
TEST SET
plot_model_metrics(models, X_test, y_test, "TEST")
we can see that the RandomForest, as the XGBoost, not only performs better, but is also more robust than the Logistic Regression.
4.2 MULTI-CLASS CLASSIFICATION¶
For this section, we used the same models as for the binary classification with the addition of Neural Networks.
df = pd.read_csv('dataset/train_sel_hclust.csv')
Data preprocessing¶
# Drop unnecessary columns
X = df.drop(['is_benign', 'attack', 'category'], axis=1)
y = df['category']
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=37)
Defining Evalutation Method¶
As with the binary classification, we defined a function to calculate for each model the chosen metrics and plot the confusion matrix. This time, we had to choose an averaging method (through the parameter 'average') since now we have more than two classes. We chose 'macro' averaging to make each metric calculated for the single class count equally. In contrast, the 'weighted' averaging method would have adjusted the metrics based on the number of entries per class, which would have minimized the impact of a low score in a class with fewer samples, something we wanted to avoid for our purpose.
def evaluate_model(y_true, y_pred, model_name="Model", class_names=None):
"""
Print comprehensive model evaluation metrics with both rich text output and seaborn heatmap.
Parameters:
-----------
y_true : array-like
True labels
y_pred : array-like
Predicted labels
model_name : str, optional
Name of the model for display purposes
class_names : list, optional
List of class names for axis labels
"""
# Calculate core metrics
metrics = {
"Accuracy": accuracy_score(y_true, y_pred),
"Precision": precision_score(y_true, y_pred, average='macro'),
"Recall": recall_score(y_true, y_pred, average='macro'),
"F1 Score": f1_score(y_true, y_pred, average='macro')
}
# Create metrics table
table_data = [[metric, f"{value:.5f}"] for metric, value in metrics.items()]
table = tabulate(table_data, headers=["Metric", "Score"], tablefmt="psql")
print(table)
# Calculate and plot confusion matrix as heatmap
cm = confusion_matrix(y_true, y_pred)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
# Create heatmap with enhanced annotations
plt.figure(figsize=(18, 10))
if class_names is None:
class_names = [f"Class {i}" for i in range(cm.shape[0])]
sns.heatmap(
cm_normalized,
annot=True,
fmt='.2f',
cmap='Blues',
xticklabels=class_names,
yticklabels=class_names,
annot_kws={'size': 16, 'weight': 'bold'} # Increased font size and made bold
)
# Adjust title and label properties
plt.title(f'{model_name} - Normalized Confusion Matrix', fontsize=16, pad=20)
plt.xlabel('Predicted', fontsize=14, weight='bold')
plt.ylabel('True', fontsize=14, weight='bold')
# Adjust tick labels size
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()
return metrics
LOGISTIC REGRESSION¶
As we move to a multi-class classification problem, considering the nature of our data, linear models may struggle to capture the complexity of structures and relationships within it. The following code implements a Logistic Regression model to verify this assumption.
An high max_iter is necessary for this multiclass in order for the logistic regression to converge to a good optimum
# Initial model
log_reg = LogisticRegression(C=1.5, max_iter=20000, random_state = 37)
# Scale the data before fitting the model
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
start_time = time.time()
# Fit the grid search
log_reg.fit(X_train_scaled, y_train);
elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")
Elapsed time 63.54 seconds
y_pred = log_reg.predict(X_val_scaled)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "Logistic Regression Multiclass", np.unique(y))
+-----------+---------+ | Metric | Score | |-----------+---------| | Accuracy | 0.75682 | | Precision | 0.79062 | | Recall | 0.69217 | | F1 Score | 0.7058 | +-----------+---------+
dump(log_reg, 'models/log_reg_multiclass.joblib');
RANDOM FOREST¶
Optimising Hyperparameters¶
This time, we used a Bayesian Optimisation, with 3-folds cross validation, to choose the best parameters for our model.
This method enables a more intelligent search, since, instead of evaluating all combinations in the parameter grid, it builds a probabilistic model of the performance function and explores the parameter space accordingly. Thus:
- It's possible to assign parameter ranges (for numeric parameters)
- This search method is more efficient: it needs fewer iterations
def average_score(y_true, y_pred):
return (accuracy_score(y_true, y_pred) +
precision_score(y_true, y_pred, average='macro') +
recall_score(y_true, y_pred, average='macro') +
f1_score(y_true, y_pred, average='macro')) / 4
average_score = make_scorer(average_score, needs_proba=False)
# Initial model
rnd_forest = RandomForestClassifier()
# Define the parameter space for Bayesian optimization
param_space = {
'criterion': ['gini', 'entropy', 'log_loss'],
'max_depth': (5, 50),
'n_estimators': (20,100),
'random_state': [37]
}
# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)
# Set up BayesSearchCV
search = BayesSearchCV(
estimator=rnd_forest,
search_spaces=param_space,
n_iter=25, # Number of parameter combinations to evaluate
cv=cv,
scoring=average_score, # Our custom scoring metric
verbose=0,
random_state=37,
n_jobs=-1 # Use all available cores
)
TRAINING
start_time = time.time()
# Fit the grid search
search.fit(X_train, y_train)
elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")
# Access the best model and parameters
best_model = search.best_estimator_
best_params = search.best_params_
# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Elapsed time 2048.43 seconds
Best Parameters by Average Score:
+--------------+---------+ | Parameter | Value | |--------------+---------| | criterion | gini | | max_depth | 50 | | n_estimators | 100 | | random_state | 37 | +--------------+---------+
VALIDATION
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val, y_pred, "Random Forest Multiclass", np.unique(y))
+-----------+---------+ | Metric | Score | |-----------+---------| | Accuracy | 0.9957 | | Precision | 0.98658 | | Recall | 0.97093 | | F1 Score | 0.97843 | +-----------+---------+
dump(best_model, 'models/rnd_forest_multiclass.joblib');
XGBOOST¶
DATA PREPARATION
# Need to encode the target labels
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_val_encoded = le.transform(y_val)
SETTING UP THE MODEL
# Initial model
model = xgb.XGBClassifier(
objective='multi:softprob',
num_class=len(np.unique(le.classes_)),
random_state=37,
n_estimators=100,
learning_rate=0.1
)
# Parameter space
param_space = {
'max_depth': (5, 50),
'min_child_weight': (1, 6),
'gamma': (0, 0.5),
'subsample': (0.6, 1.0),
'colsample_bytree': (0.6, 1.0)
}
# Set up StratifiedKFold with random_state for reproducibility
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=37)
# Set up BayesSearchCV
search = BayesSearchCV(
estimator = model,
search_spaces = param_space,
n_iter=37,
cv=cv,
scoring='accuracy', # Custom scorer doesn't work with label encoded targets
n_jobs=-1,
verbose=0,
random_state=37
)
TRAINING
start_time = time.time()
# Fit the grid search
search.fit(X_train, y_train_encoded)
elapsed_time = time.time() - start_time
print(f"Elapsed time {elapsed_time:.2f} seconds")
# Access the best model and parameters
best_model = search.best_estimator_
best_params = search.best_params_
# Print the best parameters in a table
param_table = [[param, value] for param, value in best_params.items()]
print("Best Parameters by Average Score:")
print(tabulate(param_table, headers=["Parameter", "Value"], tablefmt="psql"))
Elapsed time 2232.46 seconds
Best Parameters by Average Score:
+------------------+-----------+ | Parameter | Value | |------------------+-----------| | colsample_bytree | 0.766697 | | gamma | 0.344138 | | max_depth | 16 | | min_child_weight | 2 | | subsample | 0.883303 | +------------------+-----------+
VALIDATION
y_pred = best_model.predict(X_val)
# Evaluate the model
_ = evaluate_model(y_val_encoded, y_pred, "XGBoost Multiclass", le.classes_)
+-----------+---------+ | Metric | Score | |-----------+---------| | Accuracy | 0.99541 | | Precision | 0.9833 | | Recall | 0.97083 | | F1 Score | 0.97686 | +-----------+---------+
dump(best_model, 'models/xgboost_multiclass.joblib');
Neural Network¶
SMOTE
To address class imbalance and improve model performance:
- Undersampling: reduced large classes to a set upper limit.
- Oversampling with SMOTE: generated synthetic samples for small classes to reach a minimum count.
We apply this specifically for neural networks because they are sensitive to class imbalance, which can lead to biased predictions and poor generalization for minority classes. Balancing ensures the model trains effectively across all categories.
df = pd.read_csv('dataset/train_labeled.csv')
# Get category counts
category_counts = df['category'].value_counts()
# Print category counts table
print(tabulate(
category_counts.reset_index().values,
headers=['Category', 'Count'],
tablefmt='psql'
))
+------------+---------+ | Category | Count | |------------+---------| | DDoS | 581986 | | DoS | 284552 | | BENIGN | 192732 | | MQTT | 107607 | | RECON | 52541 | | SPOOFING | 16047 | +------------+---------+
# Calculate target counts
min_ratio = 0.5
max_ratio = 1.1
median_count = category_counts.median()
min_samples = int(median_count * min_ratio) # Lower bound
max_samples = int(median_count * max_ratio) # Upper bound
# Initialize final dataframes list
final_dfs = []
# Handle undersampling for large classes
for category in category_counts[category_counts > max_samples].index:
category_df = df[df['category'] == category]
downsampled = resample(category_df,
n_samples=max_samples,
random_state=37)
final_dfs.append(downsampled)
# Handle oversampling for small classes
small_categories = category_counts[category_counts < min_samples].index
if len(small_categories) > 0:
# Prepare data for SMOTE
small_df = df[df['category'].isin(small_categories)]
cat_cols = ['category', 'attack']
cat_data = small_df[cat_cols].copy()
# Apply SMOTE
smote = SMOTE(sampling_strategy={
cat: min(category_counts[cat] * 2, min_samples) for cat in small_categories
}, random_state=37)
X_resampled, y_resampled = smote.fit_resample(
small_df.drop(cat_cols, axis=1),
small_df['category']
)
# Reconstruct DataFrame
augmented_df = pd.DataFrame(X_resampled, columns=df.drop(cat_cols, axis=1).columns)
augmented_df['category'] = y_resampled
augmented_df['attack'] = cat_data['attack'].iloc[0] # Simplified attack labeling
final_dfs.append(augmented_df)
# Keep medium-sized classes as is
medium_mask = (category_counts >= min_samples) & (category_counts <= max_samples)
for category in category_counts[medium_mask].index:
final_dfs.append(df[df['category'] == category])
# Combine all data
balanced_df = pd.concat(final_dfs, ignore_index=True)
balanced_df.to_csv('dataset/train_smote.csv', index=False)
# Get final counts and prepare comparison
final_counts = balanced_df['category'].value_counts()
# Create comparison table
comparison_data = []
for category in sorted(category_counts.index):
comparison_data.append([
category,
category_counts[category],
final_counts.get(category, 0)
])
# Print comparison table
print(tabulate(
comparison_data,
headers=['Category', 'Original', 'After Balance'],
tablefmt='psql'
))
+------------+------------+-----------------+ | Category | Original | After Balance | |------------+------------+-----------------| | BENIGN | 192732 | 165186 | | DDoS | 581986 | 165186 | | DoS | 284552 | 165186 | | MQTT | 107607 | 107607 | | RECON | 52541 | 75084 | | SPOOFING | 16047 | 32094 | +------------+------------+-----------------+
In this section we prepared and preprocessed the input data for the development of a neural network model whose purpose is to classify data distinguishing between benign traffic and each attack.
First of all, label encoding, converting categorical labels into integers, allows for the compatibility of data type with the classification task of NN. In order to stabilize the learning process of NN and facilitate the convergence, it was necessary to standardize the input features, making the range of values more restricted.
Then the use of PyTorch’s TensorDataset and DataLoader improves the performance by batching, shuffling and loading the data during the training.
def prepare_data_loaders(df, batch_size=64, test_size=0.3, random_state=37):
"""
Prepare data loaders for training and validation
"""
# Initialize preprocessing objects
label_encoder = LabelEncoder()
scaler = StandardScaler()
# Prepare features and labels
x_train = df.drop(['is_benign', 'category', 'attack'], axis=1).values
y_train = df['category']
# Encode labels as integers
y_train = label_encoder.fit_transform(y_train)
# Split data
x_train_data, x_val_data, y_train_data, y_val_data = train_test_split(
x_train, y_train, test_size=test_size, random_state=random_state
)
# Scale features
x_train_scaled = scaler.fit_transform(x_train_data)
x_val_scaled = scaler.transform(x_val_data)
# Convert to PyTorch tensors
x_train_tensor = torch.tensor(x_train_scaled, dtype=torch.float32)
x_val_tensor = torch.tensor(x_val_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_data, dtype=torch.long)
y_val_tensor = torch.tensor(y_val_data, dtype=torch.long)
# Create datasets and loaders
train_dataset = TensorDataset(x_train_tensor, y_train_tensor)
val_dataset = TensorDataset(x_val_tensor, y_val_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
return {
'train_loader': train_loader,
'val_loader': val_loader,
'val_dataset': val_dataset,
'input_size': x_train.shape[1],
'num_classes': len(label_encoder.classes_),
'class_counts': torch.bincount(y_train_tensor),
'y_train_tensor': y_train_tensor,
'classes': label_encoder.classes_
}
This neural network is designed to handle imbalanced data effectively. The feature extractor captures essential patterns using normalization, dropout, and non-linear activation. Residual connections in the architecture address vanishing gradients, allowing deeper learning even with imbalanced datasets. Robust initialization with He initialization ensures stable training, making the model resilient and capable of generalizing better across all classes. This approach is essential for neural networks, which are particularly sensitive to class imbalances and require careful design to prevent bias.
class FeatureExtractor(nn.Module):
def __init__(self, input_size):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_size, 512),
nn.BatchNorm1d(512),
nn.LeakyReLU(0.1),
nn.Dropout(0.1)
)
def forward(self, x):
return self.layers(x)
class ResidualBlock(nn.Module):
def __init__(self, size):
super().__init__()
self.block = nn.Sequential(
nn.BatchNorm1d(size),
nn.Linear(size, 2*size),
nn.LeakyReLU(0.1),
nn.Linear(2*size, size),
nn.Dropout(0.1),
)
self.activation = nn.LeakyReLU(0.1)
def forward(self, x):
identity = x
out = self.block(x)
out += identity
return self.activation(out)
class NeuralNetwork(nn.Module):
def __init__(self, input_size, num_classes):
super().__init__()
# Feature extraction path
self.feature_extractor = FeatureExtractor(input_size)
# Main processing path with residual connections
self.main_path = nn.Sequential(
ResidualBlock(512),
ResidualBlock(512),
ResidualBlock(512),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.LeakyReLU(0.2),
nn.Dropout(0.1)
)
self.classifier = nn.Sequential(
nn.Linear(256, num_classes)
)
self._initialize_weights()
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.BatchNorm1d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
def forward(self, x):
# Extract features
features = self.feature_extractor(x)
# Process through main path
main_features = self.main_path(features)
# Classification
output = self.classifier(main_features)
return output
def get_optimizer(model, learning_rate=0.001, weight_decay=1e-5):
"""
Create optimizer for the model
"""
return torch.optim.AdamW(
model.parameters(),
lr=learning_rate,
weight_decay=weight_decay,
betas=(0.9, 0.999)
)
def get_scheduler(optimizer):
"""
Create learning rate scheduler
"""
return torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode='min',
factor=0.1,
patience=5
)
# Example of model initialization (to be used in the training loop):
def initialize_model(input_size, num_classes, device):
"""
Initialize the model, optimizer, and scheduler
"""
model = NeuralNetwork(input_size, num_classes).to(device)
optimizer = get_optimizer(model)
scheduler = get_scheduler(optimizer)
return model, optimizer, scheduler
A dictionary named “history” is created in order to record the performance of the main metrics in each epoch during the training. Every metric evolution is then plotted on a line chart with specifically defined colors (blue for Training Loss, orange for Validation Loss and green for Accuracy).
class TrainingMetrics:
def __init__(self):
self.history = {
'train_loss': [],
'val_loss': [],
'accuracy': [],
'learning_rates': []
}
def update(self, train_loss, val_loss, accuracy, lr):
self.history['train_loss'].append(train_loss)
self.history['val_loss'].append(val_loss)
self.history['accuracy'].append(accuracy)
self.history['learning_rates'].append(lr)
def plot(self, dataset_name):
fig, ax1 = plt.subplots(figsize=(20, 10))
# Plot metrics on primary y-axis
metrics = ['train_loss', 'val_loss', 'accuracy']
colors = ['blue', 'orange', 'green']
for metric, color in zip(metrics, colors):
ax1.plot(self.history[metric],
label=metric.replace('_', ' ').title(),
color=color)
# Create secondary y-axis for learning rate
ax2 = ax1.twinx()
ax2.set_yscale('log') # Set logarithmic scale for learning rate
ax2.plot(self.history['learning_rates'],
label='Learning Rate',
color='red',
linestyle='--')
# Customize plot
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Metrics Value')
ax2.set_ylabel('Learning Rate (log scale)')
# Combine legends from both axes
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc='center right')
plt.title(f'Training Metrics - {dataset_name}')
plt.grid(True)
plt.tight_layout()
plt.show()
The following code describes the implementation of the NN model, designed to classify malicious attack types and benign network flows. The strength point of this phase is the fact that the model is trained sequentially on three distinct datasets, whose performances are then compared, respectively:
- train_labeled
- train_smote, in which data have been rebalanced in order to oversample benign data.
- train_sel_hclust
For each epoch, during the training phase the model processes batch of data to calculate the loss (cross-entropy loss). The optimizer updates the model’s weight using backpropagation and the scheduler dynamically fix the learning rate based on the validation loss. Then the model is evaluated on the validation set and the metrics are calculated. If the model accuracy is higher than the best accuracy up to that moment, the model’s state is saved as best model; otherwise if accuracy does not improve after a certain number of epochs, defined by patience arbitrarily set to 15, but reduced to 9 under certain conditions, in order to improve training efficiency and avoid overfitting the training stops (early stopping).
After training, the “evaluate_model” function calculates performance metrics (respectively, precision, recall and F1-Score) comparing the predicted labels with the real ones. Results are shown in a bar chart representing metrics for each individual class. Furthermore, a normalized confusion matrix is plotted in order to easily visualize the types of errors the model makes.
A summary table consolidates the results for easy comparison of best accuracy and validation loss across the three datasets.
def evaluate_nn(model, val_loader, classes, device):
"""Enhanced model evaluation with custom visualizations"""
y_pred = []
y_true = []
probs = []
model.eval()
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
_, predicted = torch.max(outputs, 1)
probs.append(torch.softmax(outputs, dim=1).cpu().numpy())
y_pred.extend(predicted.cpu().numpy())
y_true.extend(labels.cpu().numpy())
y_pred = np.array(y_pred)
y_true = np.array(y_true)
probs = np.concatenate(probs)
# Plot metrics
_, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
# Custom Classification Report Visualization
metrics = {
'Precision': precision_score(y_true, y_pred, average=None),
'Recall': recall_score(y_true, y_pred, average=None),
'F1-Score': f1_score(y_true, y_pred, average=None)
}
metrics_df = pd.DataFrame(metrics, index=classes)
metrics_df.plot(kind='bar', ax=ax1, color=sns.color_palette("Blues", n_colors=len(metrics)))
ax1.set_title('Classification Metrics by Class')
ax1.set_xticklabels(classes, rotation=45)
ax1.grid(True, alpha=0.3)
# Enhanced Confusion Matrix
sns.heatmap(confusion_matrix(y_true, y_pred, normalize='true'),
annot=True, fmt='.2f', cmap='Blues',
xticklabels=classes, yticklabels=classes, ax=ax2)
ax2.set_title('Normalized Confusion Matrix')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('True')
plt.tight_layout()
plt.show()
# Return metrics for logging
return {
'accuracy': accuracy_score(y_true, y_pred),
'macro_f1': f1_score(y_true, y_pred, average='macro'),
'metrics_by_class': metrics_df
}
def train_model(model, train_loader, val_loader, val_dataset, criterion, optimizer,
scheduler, epochs, device, dataset_name, save_dir, goat):
print(f"Training on dataset: {dataset_name}")
metrics = TrainingMetrics()
best_val_loss = float('inf')
best_accuracy = 0.0
patience = 9 if goat != 0 else 15
counter = 0
for epoch in range(epochs):
# Training phase
model.train()
epoch_loss = 0
num_batches = len(train_loader)
for batch_X, batch_y in tqdm(train_loader, desc=f"{dataset_name} - Epoch {epoch + 1}/{epochs}", bar_format='{desc}: {elapsed}'):
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
optimizer.zero_grad()
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
# Validation phase
model.eval()
val_loss = 0
correct = 0
with torch.no_grad():
for batch_X, batch_y in val_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
outputs = model(batch_X)
val_loss += criterion(outputs, batch_y).item()
_, predicted = torch.max(outputs, 1)
correct += (predicted == batch_y).sum().item()
# Compute metrics
avg_train_loss = epoch_loss / num_batches
avg_val_loss = val_loss / len(val_loader)
accuracy = correct / len(val_dataset)
current_lr = optimizer.param_groups[0]['lr']
metrics.update(avg_train_loss, avg_val_loss, accuracy, current_lr)
# Update best model
if accuracy > best_accuracy:
best_accuracy = accuracy
best_val_loss = avg_val_loss
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'best_accuracy': best_accuracy,
'best_val_loss': best_val_loss
}, Path(save_dir) / f"best_model_{dataset_name}.pth")
if best_accuracy > goat:
counter = 0
goat = best_accuracy
patience = 15
else:
counter += 1
scheduler.step(avg_val_loss)
if counter >= patience:
print(f"Early stopping after {patience} epochs without improvement")
break
metrics.plot(dataset_name)
return best_accuracy, best_val_loss, goat
def train_on_multiple_datasets(dataset_paths, save_dir='models', merged=False):
"""
Train the model sequentially on multiple datasets
"""
# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Create save directory
save_dir = Path(save_dir)
# Training configuration
config = {
'batch_size': 64,
'epochs': 30,
'learning_rate': 0.001,
'weight_decay': 1e-5,
}
results = {}
goat = 0.0
# Train on each dataset sequentially
for dataset_path in dataset_paths:
dataset_name = Path(dataset_path).stem
# Load and prepare data
df = pd.read_csv(dataset_path, low_memory=False)
if merged:
df['category'] = df['category'].replace({
'DDoS': 'DOS_DDOS',
'DoS': 'DOS_DDOS'
})
dataset_name += '_merged'
data = prepare_data_loaders(df, batch_size=config['batch_size'])
class_counts = data['class_counts']
# Initialize model, optimizer, and criterion
model = NeuralNetwork(data['input_size'], data['num_classes']).to(device)
optimizer = get_optimizer(model, config['learning_rate'], config['weight_decay'])
scheduler = get_scheduler(optimizer)
# Calculate class weights for balanced training
total_samples = len(data['y_train_tensor'])
class_weights = total_samples / (len(data['class_counts']) * data['class_counts'])
criterion = nn.CrossEntropyLoss(weight=class_weights.to(device))
best_accuracy, best_val_loss, goat = train_model(
model=model,
train_loader=data['train_loader'],
val_loader=data['val_loader'],
val_dataset=data['val_dataset'],
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
epochs=config['epochs'],
device=device,
dataset_name=dataset_name,
save_dir=save_dir,
goat=goat
)
# Check if the best model file exists
best_model_path = Path(save_dir) / f"best_model_{dataset_name}.pth"
if not best_model_path.exists():
print(f"Best model file for {dataset_name} not found. Skipping evaluation.")
continue
# Load best model for evaluation
checkpoint = torch.load(best_model_path, weights_only=True);
model.load_state_dict(checkpoint['model_state_dict'])
# Evaluate model
# print(f"\nEvaluating model for dataset: {dataset_name}")
evaluate_nn(model, data['val_loader'], data['classes'], device)
# Store results
results[dataset_name] = {
'best_accuracy': best_accuracy,
'best_val_loss': best_val_loss
}
# Print final results summary in a table using tabulate
results_df = pd.DataFrame(results).T
results_df.columns = ['Best Accuracy', 'Best Validation Loss']
print("\nTraining Results Summary:")
print(tabulate(results_df, headers='keys', tablefmt='psql'))
dataset_paths = [
"dataset/train_labeled.csv",
"dataset/train_smote.csv",
"dataset/train_sel_hclust.csv",
]
train_on_multiple_datasets(dataset_paths)
Using device: cuda
Training on dataset: train_labeled
train_labeled - Epoch 1/30: 00:57 train_labeled - Epoch 2/30: 00:57 train_labeled - Epoch 3/30: 00:57 train_labeled - Epoch 4/30: 00:56 train_labeled - Epoch 5/30: 00:58 train_labeled - Epoch 6/30: 00:58 train_labeled - Epoch 7/30: 00:57 train_labeled - Epoch 8/30: 00:57 train_labeled - Epoch 9/30: 00:58 train_labeled - Epoch 10/30: 00:57 train_labeled - Epoch 11/30: 00:57 train_labeled - Epoch 12/30: 00:57 train_labeled - Epoch 13/30: 00:57 train_labeled - Epoch 14/30: 00:58 train_labeled - Epoch 15/30: 00:57 train_labeled - Epoch 16/30: 00:57 train_labeled - Epoch 17/30: 00:57 train_labeled - Epoch 18/30: 00:57 train_labeled - Epoch 19/30: 00:57 train_labeled - Epoch 20/30: 00:57 train_labeled - Epoch 21/30: 00:57 train_labeled - Epoch 22/30: 00:57 train_labeled - Epoch 23/30: 00:57 train_labeled - Epoch 24/30: 00:57 train_labeled - Epoch 25/30: 00:57 train_labeled - Epoch 26/30: 00:57 train_labeled - Epoch 27/30: 00:57 train_labeled - Epoch 28/30: 00:57
Early stopping after 15 epochs without improvement
Training on dataset: train_smote
train_smote - Epoch 1/30: 00:33 train_smote - Epoch 2/30: 00:33 train_smote - Epoch 3/30: 00:33 train_smote - Epoch 4/30: 00:33 train_smote - Epoch 5/30: 00:33 train_smote - Epoch 6/30: 00:33 train_smote - Epoch 7/30: 00:33 train_smote - Epoch 8/30: 00:33 train_smote - Epoch 9/30: 00:33 train_smote - Epoch 10/30: 00:33 train_smote - Epoch 11/30: 00:33 train_smote - Epoch 12/30: 00:33 train_smote - Epoch 13/30: 00:33 train_smote - Epoch 14/30: 00:33 train_smote - Epoch 15/30: 00:33 train_smote - Epoch 16/30: 00:33 train_smote - Epoch 17/30: 00:33 train_smote - Epoch 18/30: 00:33 train_smote - Epoch 19/30: 00:33 train_smote - Epoch 20/30: 00:33 train_smote - Epoch 21/30: 00:33 train_smote - Epoch 22/30: 00:33 train_smote - Epoch 23/30: 00:33 train_smote - Epoch 24/30: 00:33 train_smote - Epoch 25/30: 00:33 train_smote - Epoch 26/30: 00:33 train_smote - Epoch 27/30: 00:33 train_smote - Epoch 28/30: 00:33
Early stopping after 15 epochs without improvement
Training on dataset: train_sel_hclust
train_sel_hclust - Epoch 1/30: 00:58 train_sel_hclust - Epoch 2/30: 00:57 train_sel_hclust - Epoch 3/30: 00:57 train_sel_hclust - Epoch 4/30: 00:58 train_sel_hclust - Epoch 5/30: 00:57 train_sel_hclust - Epoch 6/30: 00:58 train_sel_hclust - Epoch 7/30: 00:57 train_sel_hclust - Epoch 8/30: 00:57 train_sel_hclust - Epoch 9/30: 00:57
Early stopping after 9 epochs without improvement
Training Results Summary:
+------------------+-----------------+------------------------+ | | Best Accuracy | Best Validation Loss | |------------------+-----------------+------------------------| | train_labeled | 0.772842 | 0.35129 | | train_smote | 0.803503 | 0.324712 | | train_sel_hclust | 0.70862 | 0.382404 | +------------------+-----------------+------------------------+
Given these result, it is clear how difficult for the model it is to distinguish between DoS and DDoS attacks. In order to address this issue, we tried to merge the two categories (also due to the fact that at their root these attacks are similar).
train_on_multiple_datasets(dataset_paths, merged = True)
Using device: cuda
Training on dataset: train_labeled
train_labeled - Epoch 1/30: 00:58 train_labeled - Epoch 2/30: 00:57 train_labeled - Epoch 3/30: 00:57 train_labeled - Epoch 4/30: 00:57 train_labeled - Epoch 5/30: 00:57 train_labeled - Epoch 6/30: 00:57 train_labeled - Epoch 7/30: 00:58 train_labeled - Epoch 8/30: 00:58 train_labeled - Epoch 9/30: 00:57 train_labeled - Epoch 10/30: 00:57 train_labeled - Epoch 11/30: 00:59 train_labeled - Epoch 12/30: 00:58 train_labeled - Epoch 13/30: 00:57 train_labeled - Epoch 14/30: 00:57 train_labeled - Epoch 15/30: 00:57 train_labeled - Epoch 16/30: 00:57 train_labeled - Epoch 17/30: 00:57 train_labeled - Epoch 18/30: 00:57 train_labeled - Epoch 19/30: 00:57 train_labeled - Epoch 20/30: 00:57 train_labeled - Epoch 21/30: 00:57 train_labeled - Epoch 22/30: 00:57 train_labeled - Epoch 23/30: 00:57 train_labeled - Epoch 24/30: 00:57 train_labeled - Epoch 25/30: 00:57 train_labeled - Epoch 26/30: 00:57 train_labeled - Epoch 27/30: 00:57 train_labeled - Epoch 28/30: 00:57 train_labeled - Epoch 29/30: 00:57 train_labeled - Epoch 30/30: 00:57
Training on dataset: train_smote
train_smote - Epoch 1/30: 00:33 train_smote - Epoch 2/30: 00:33 train_smote - Epoch 3/30: 00:33 train_smote - Epoch 4/30: 00:33 train_smote - Epoch 5/30: 00:33 train_smote - Epoch 6/30: 00:33 train_smote - Epoch 7/30: 00:33 train_smote - Epoch 8/30: 00:33 train_smote - Epoch 9/30: 00:33
Early stopping after 9 epochs without improvement
Training on dataset: train_sel_hclust
train_sel_hclust - Epoch 1/30: 00:58 train_sel_hclust - Epoch 2/30: 00:58 train_sel_hclust - Epoch 3/30: 00:57 train_sel_hclust - Epoch 4/30: 00:57 train_sel_hclust - Epoch 5/30: 00:58 train_sel_hclust - Epoch 6/30: 00:57 train_sel_hclust - Epoch 7/30: 00:57 train_sel_hclust - Epoch 8/30: 00:57 train_sel_hclust - Epoch 9/30: 00:57
Early stopping after 9 epochs without improvement
Training Results Summary:
+------------------+-----------------+------------------------+ | | Best Accuracy | Best Validation Loss | |------------------+-----------------+------------------------| | train_labeled | 0.97643 | 0.150561 | | train_smote | 0.957335 | 0.158727 | | train_sel_hclust | 0.972842 | 0.217276 | +------------------+-----------------+------------------------+
TESTING¶
def plot_model_metrics(results, name="VALIDATION"):
metrics = ['accuracy', 'precision', 'recall', 'f1']
model_names = ['Logistic Regression', 'Random Forest', 'XGBoost',
'Neural Network (6 Classes)', 'Neural Network (5 Classes)']
fig, ax = plt.subplots(figsize=(15, 6))
width = 0.15
x = np.arange(len(metrics))
# Create a palette with 7 colors and use only last 5
colors = sns.color_palette("Blues", n_colors=len(models)+2)[2:]
for i, (model, color) in enumerate(zip(model_names, colors)):
values = [results[f"{model}_{metric}"][0] for metric in metrics]
rects = ax.bar(x + i * width, values, width, label=model, color=color)
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., height/2.,
f'{height:.2f}',
ha='center', va='center',
color='white' if height > 0.3 else 'black',
fontsize=10,
weight='bold')
ax.set_xlabel('Metrics', weight='bold')
ax.set_title(f'Comparison of Classification Metrics ({name})', weight='bold')
ax.set_xticks(x + width * 2)
ax.set_xticklabels(metrics)
ax.set_ylim([0, 1])
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
LOADING MODELS
The NNWrapper class is used to wrap the neural networks, providing a unified predict method for compatibility with the training and evaluation pipeline. This is necessary because neural networks in PyTorch do not inherently have a predict function, unlike scikit-learn models. The wrapper converts input data into tensors and ensures predictions are made efficiently in evaluation mode.
class NNWrapper:
def __init__(self, model):
self.model = model
def predict(self, X):
self.model.eval()
with torch.no_grad():
X_tensor = torch.tensor(X.values, dtype=torch.float32)
logits = self.model(X_tensor)
return torch.argmax(logits, dim=1).numpy()
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
log_reg = load('models/log_reg_multiclass.joblib')
rnd_frs = load('models/rnd_forest_multiclass.joblib')
xgboost = load('models/xgboost_multiclass.joblib')
# Define the model architecture
loaded = torch.load('models/best_model_train_smote.pth', map_location=torch.device('cpu'))
input_size = loaded['model_state_dict']['feature_extractor.layers.0.weight'].shape[1]
num_classes = loaded['model_state_dict']['classifier.0.weight'].shape[0]
nn_6cla = NeuralNetwork(input_size, num_classes)
nn_6cla.load_state_dict(torch.load('models/best_model_train_smote.pth', map_location=torch.device('cpu'))['model_state_dict'])
nn_6cla.eval()
loaded = torch.load('models/best_model_train_labeled_merged.pth', map_location=torch.device('cpu'))
input_size = loaded['model_state_dict']['feature_extractor.layers.0.weight'].shape[1]
num_classes = loaded['model_state_dict']['classifier.0.weight'].shape[0]
nn_5cla = NeuralNetwork(input_size, num_classes)
nn_5cla.load_state_dict(torch.load('models/best_model_train_labeled_merged.pth', map_location=torch.device('cpu'))['model_state_dict'])
nn_5cla.eval()
models = {
'Logistic Regression': log_reg,
'Random Forest': rnd_frs,
'XGBoost': xgboost,
'Neural Network (6 Classes)': NNWrapper(nn_6cla),
'Neural Network (5 Classes)': NNWrapper(nn_5cla)
}
VALIDATION SET
Each model is linked with specific dataset requirements, such as scaling or encoding, to align with its training conditions. The script evaluates all models on validation datasets, calculating metrics like accuracy, precision, recall, and F1 score for consistent performance comparison.
def process_train_dataset(df_path, needs_scaling=False, needs_encoding=False):
df = pd.read_csv(df_path)
if 'label' in df_path:
df['category'] = df['category'].replace({
'DDoS': 'DOS_DDOS',
'DoS': 'DOS_DDOS'
})
X = df.drop(['is_benign', 'category', 'attack'], axis=1)
y = df['category']
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7, random_state=37)
scaler = None
if needs_scaling:
scaler = StandardScaler()
scaler.fit(X_train)
X_val = pd.DataFrame(scaler.transform(X_val), columns=X_val.columns)
if needs_encoding:
le = LabelEncoder()
y_val = le.fit_transform(y_val)
del df, X, y, X_train, y_train
return X_val, y_val
import warnings
warnings.filterwarnings('ignore', category=UserWarning)
train_requirements = {
'Logistic Regression': ('dataset/train_sel_hclust.csv', True, False),
'Random Forest': ('dataset/train_sel_hclust.csv', False, False),
'XGBoost': ('dataset/train_sel_hclust.csv', False, True),
'Neural Network (6 Classes)': ('dataset/train_smote.csv', True, True),
'Neural Network (5 Classes)': ('dataset/train_labeled.csv', True, True)
}
train_results = defaultdict(list)
for model_name, model in models.items():
path, needs_scaling, needs_encoding = train_requirements[model_name]
result = process_train_dataset(path, needs_scaling, needs_encoding)
X_val, y_val = result
y_pred = model.predict(X_val)
metrics = {
'accuracy': accuracy_score(y_val, y_pred),
'precision': precision_score(y_val, y_pred, average='macro'),
'recall': recall_score(y_val, y_pred, average='macro'),
'f1': f1_score(y_val, y_pred, average='macro')
}
for metric, value in metrics.items():
train_results[f"{model_name}_{metric}"].append(value)
del X_val, y_val
plot_model_metrics(train_results)
TEST SET
The function processes test data by aligning with each model's training conditions: fitting StandardScaler on training data before transforming test features, handling label transformations (DDoS/DoS -> DOS_DDOS), and encoding if needed, as before
def process_test_data(test_path, train_path, needs_scaling=False, needs_encoding=False):
df_test = pd.read_csv(test_path)
if 'hclust' in train_path or needs_encoding:
df_train = pd.read_csv(train_path)
if 'hclust' in train_path:
df_test = df_test[df_train.columns.intersection(df_test.columns)]
X_test = df_test.drop(['is_benign', 'category', 'attack'], axis=1)
y_test = df_test['category']
# Handle label transformations for labeled datasets
if 'label' in train_path:
y_test = y_test.replace({
'DDoS': 'DOS_DDOS',
'DoS': 'DOS_DDOS'
})
if needs_encoding:
le = LabelEncoder()
y_test = le.fit_transform(y_test)
if needs_scaling:
X_train, _, _, _ = train_test_split(df_train.drop(['is_benign', 'category', 'attack'], axis=1), df_train['category'], train_size=0.7, random_state=37)
scaler = StandardScaler()
scaler.fit(X_train)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
return X_test, y_test
import warnings
warnings.filterwarnings('ignore', category=UserWarning)
test_requirements = {
'Logistic Regression': ('dataset/test_labeled.csv', 'dataset/train_sel_hclust.csv', True, False),
'Random Forest': ('dataset/test_labeled.csv', 'dataset/train_sel_hclust.csv', False, False),
'XGBoost': ('dataset/test_labeled.csv', 'dataset/train_sel_hclust.csv', False, True),
'Neural Network (6 Classes)': ('dataset/test_labeled.csv', 'dataset/train_smote.csv', True, True),
'Neural Network (5 Classes)': ('dataset/test_labeled.csv', 'dataset/train_labeled.csv', True, True)
}
test_results = defaultdict(list)
for model_name, model in models.items():
test_path, train_path, needs_scaling, needs_encoding = test_requirements[model_name]
X_test, y_test = process_test_data(test_path, train_path, needs_scaling, needs_encoding)
y_pred = model.predict(X_test)
metrics = {
'accuracy': accuracy_score(y_test, y_pred),
'precision': precision_score(y_test, y_pred, average='macro', zero_division=0),
'recall': recall_score(y_test, y_pred, average='macro', zero_division=0),
'f1': f1_score(y_test, y_pred, average='macro', zero_division=0)
}
for metric, value in metrics.items():
test_results[f"{model_name}_{metric}"].append(value)
plot_model_metrics(test_results, "TEST")
Comparing with the results obtained in the binary classification, we see that the logistic regression worsened in the multiclass, while our decision tree based ensembles maintained a very good performance with just a slight decrese, confirming themselves as the strongest models.
The 6 classes NN performed badly if we consider that it did worse than the logistic regression, a model that is way simpler; on the other hand, the 5 classes NN scores where rather good. It's to notice that that of this project isn't the kind of data optimal for Neural Network usage.
5. UNSUPERVISED LEARNING¶
5.1 CLUSTERING¶
df = pd.read_csv('dataset/train_sel_hclust.csv')
Data Preprocessing Functions¶
This code implements preprocessing function in order to rebalance and rescale data, which are beneficial before executing clustering algorithms.
def balance_dataset(df, method='median_multiplier', multiplier=2, column='category'):
"""
Balance the dataset by downsampling majority classes.
Parameters:
- df: pandas DataFrame with 'category' or 'attack' column
- method: str, approach for calculating threshold
- multiplier: float, multiplier for threshold calculation
Returns:
- balanced DataFrame
"""
counts = df[column].value_counts()
# Calculate threshold
if method == 'median_multiplier':
threshold = np.median(counts) * multiplier
elif method == 'mean_multiplier':
threshold = np.mean(counts) * multiplier
elif method == 'quantile':
threshold = counts.quantile(0.75)
else:
raise ValueError("Method must be 'median_multiplier', 'mean_multiplier', or 'quantile'")
# Balance categories
balanced_dfs = []
for value in counts.index:
value_df = df[df[column] == value]
if len(value_df) > threshold:
value_df = resample(value_df, replace=False, n_samples=int(threshold), random_state=42)
balanced_dfs.append(value_df)
balanced_df = pd.concat(balanced_dfs)
comparison_data = [
[value, counts[value], balanced_df[column].value_counts().get(value, 0)]
for value in sorted(counts.index)
]
comparison_data.sort(key=lambda x: x[1], reverse=True)
print(tabulate(
comparison_data,
headers=[column, 'Original', 'After Balance'],
tablefmt='psql'
))
return balanced_df.sample(frac=0.1, random_state=37)
def preprocess_data_cluster(df, **kwargs):
"""Scale and balance data."""
method = kwargs.get('method', 'median_multiplier')
multiplier = kwargs.get('multiplier', 2)
column = kwargs.get('column', 'category')
df = balance_dataset(df, method=method, multiplier=multiplier, column=column)
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df.drop(['is_benign', 'category', 'attack'], axis=1))
return scaled_df, df[column]
Clustering Functions¶
In this section we decided to run K-Means and DBSCAN clustering, whose results are evaluated on the basis of specified metrics, respectively the Silhouette, the Calinski-Harabasz and Davies-Bouldin scores.
As regards K-Means, in line with what was explained previously for NN, we decided to implement the model on both 5 and 6 classes, due to the similarities between DoS and DDos attacks.
While for DBSCAN the clustering task is performed adopting various eps (representing the maximum distance between two samples to be considered in the same neighborhood) and min_samples (the minimum number of samples required to form a cluster) parameters. Once completed, the best result is saved.
def kmeans_clustering(data, n_clusters):
"""Perform K-Means clustering and evaluate results."""
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=37)
labels = kmeans.fit_predict(data)
print(f"K-Means Clustering with {n_clusters} clusters completed.")
return labels, kmeans
# DBSCAN Clustering
def dbscan_clustering(data, labels, eps_values = [0.1, 0.3, 0.5, 0.7, 0.9]):
"""Perform DBSCAN clustering with different parameters and save the best result."""
best_score = -1
best_labels = None
best_eps = None
best_min_samples = None
# Determine the least frequent category
label_counts = pd.Series(labels).value_counts()
least_frequent_count = label_counts.min()
# Use some divisor of the least frequent count as min_samples_values
min_samples_values = [max(1, least_frequent_count // i) for i in range(1, 6)]
for eps in eps_values:
for min_samples in min_samples_values:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(data)
score = silhouette_score(data, labels)
if score > best_score:
best_score = score
best_labels = labels
best_eps = eps
best_min_samples = min_samples
print(f"Best DBSCAN Clustering with eps={best_eps}, min_samples={best_min_samples} completed.")
return best_labels
# Evaluate Clustering
def evaluate_clustering(data, labels):
"""Evaluate clustering using different metrics."""
silhouette = silhouette_score(data, labels)
calinski = calinski_harabasz_score(data, labels)
davies = davies_bouldin_score(data, labels)
return silhouette, calinski, davies
In the visualization phase, we have choosen t-SNE for dimensionality reduction because the resulting plot is clearer and the readability enhanced with respect to other techniques, such as PCA.
def visualize_clusters(data, cluster_labels_list, true_labels, method_names):
"""
Visualize clustering results alongside true labels using t-SNE.
Enhanced for better readability and comparison.
"""
tsne = TSNE(n_components=2, random_state=37)
reduced_data = tsne.fit_transform(data)
n_methods = len(cluster_labels_list)
_, axes = plt.subplots(n_methods, 2, figsize=(20, 8*n_methods))
# If only one method, wrap axes in 2D array for consistent indexing
if n_methods == 1:
axes = axes.reshape(1, 2)
for idx, (method_labels, method_name) in enumerate(zip(cluster_labels_list, method_names)):
# Plot clustering results
cluster_scatter = sns.scatterplot(
x=reduced_data[:, 0],
y=reduced_data[:, 1],
hue=method_labels,
palette='Set3',
alpha=0.7,
s=100, # Larger point size
ax=axes[idx, 0]
)
axes[idx, 0].set_title(f'{method_name}\nClustering Results', fontsize=12, pad=15)
cluster_scatter.legend(title="Clusters", bbox_to_anchor=(1.02, 1), loc='upper left', title_fontsize=10)
# Plot true labels
true_scatter = sns.scatterplot(
x=reduced_data[:, 0],
y=reduced_data[:, 1],
hue=true_labels,
palette='Spectral',
alpha=0.7,
s=100, # Larger point size
ax=axes[idx, 1]
)
axes[idx, 1].set_title('True Labels\nGround Truth', fontsize=12, pad=15)
true_scatter.legend(title="Categories", bbox_to_anchor=(1.02, 1), loc='upper left', title_fontsize=10)
# Remove axis labels as they're not meaningful in t-SNE space
axes[idx, 0].set_xlabel('')
axes[idx, 0].set_ylabel('')
axes[idx, 1].set_xlabel('')
axes[idx, 1].set_ylabel('')
# Add grid for better readability
axes[idx, 0].grid(True, alpha=0.3)
axes[idx, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
scaled_data, labels = preprocess_data_cluster(df)
+------------+------------+-----------------+ | category | Original | After Balance | |------------+------------+-----------------| | DDoS | 581986 | 300339 | | DoS | 284552 | 284552 | | BENIGN | 192732 | 192732 | | MQTT | 107607 | 107607 | | RECON | 52541 | 52541 | | SPOOFING | 16047 | 16047 | +------------+------------+-----------------+
# Perform and evaluate K-Means with 6 clusters
kmeans_6_labels, kmeans_6 = kmeans_clustering(scaled_data, n_clusters=6)
kmeans_6_scores = evaluate_clustering(scaled_data, kmeans_6_labels)
K-Means Clustering with 6 clusters completed.
# Perform and evaluate K-Means with 5 clusters
kmeans_5_labels, kmeans_5 = kmeans_clustering(scaled_data, n_clusters=5)
kmeans_5_scores = evaluate_clustering(scaled_data, kmeans_5_labels)
K-Means Clustering with 5 clusters completed.
# Perform and evaluate DBSCAN
dbscan_labels = dbscan_clustering(scaled_data, labels)
dbscan_scores = evaluate_clustering(scaled_data, dbscan_labels)
Best DBSCAN Clustering with eps=0.9, min_samples=404 completed.
Results for each method are shown in a summary table reporting the selected metrics for comparison purposes. Likewise, the plots below compare side to side the clustering results obtain by the three methods. As observable, despite the fact that the clustering methods succeed in performing the task of separating clusters, they are not effective in identifying the classes needed for detection of malicious events.
# Display Evaluation Results in a Table
evaluation_table = [
["K-Means-6", *kmeans_6_scores],
["K-Means-5", *kmeans_5_scores],
["DBSCAN", *dbscan_scores]
]
headers = ["Method", "Silhouette Score", "Calinski-Harabasz Score", "Davies-Bouldin Score"]
print("\nClustering Evaluation Summary:")
print(tabulate(evaluation_table, headers=headers, tablefmt="grid"))
# Visualize Clustering Results Side by Side
visualize_clusters(
scaled_data,
[kmeans_6_labels, kmeans_5_labels, dbscan_labels],
labels,
["K-Means/6","K-Means/5", f"DBSCAN/{len(dbscan_labels)}"]
)
Clustering Evaluation Summary:
+-----------+--------------------+---------------------------+------------------------+ | Method | Silhouette Score | Calinski-Harabasz Score | Davies-Bouldin Score | +===========+====================+===========================+========================+ | K-Means-6 | 0.402097 | 9673.79 | 1.33808 | +-----------+--------------------+---------------------------+------------------------+ | K-Means-5 | 0.38851 | 9900.97 | 1.50973 | +-----------+--------------------+---------------------------+------------------------+ | DBSCAN | 0.479441 | 3776.1 | 1.78247 | +-----------+--------------------+---------------------------+------------------------+
5.2 Anomaly Detection¶
We created a new dataset for anomaly detection. We wanted to recreate a realistic scenario in which most of the traffic is good traffic, and a small portion is malicious. For the malicious traffic we selected an even fraction from each of the attack categories.
# Load the dataset
df = pd.read_csv('dataset/train_sel_hclust.csv')
# Select benign data
benign_data = df[df['category'] == 'BENIGN'].sample(frac=0.1, random_state=37)
# Define the sampling fraction per attack type
sample_size = 100
# Loop through each attack type and sample data
attack_data = pd.DataFrame()
attack_types = [col for col in df['category'].unique() if col != 'BENIGN']
for attack_type in attack_types:
attack_subset = df[df['category'] == attack_type]
sampled_data = attack_subset.sample(sample_size, random_state=37)
attack_data = pd.concat([attack_data, sampled_data])
# Combine benign and attack samples
final_data = pd.concat([benign_data, attack_data])
# Save or use the new dataset
final_data.to_csv('dataset/train_anom_detect.csv', index=False)
df = pd.read_csv('dataset/train_anom_detect.csv')
sns.countplot(data=df, x='category');
Univariate Anomaly Detection¶
As a first method we looked at the outliers for each feature separately, using boxplots. In particular, the whiskers of the boxplots extend 1.5 times the interquartile above/below the top/bottom of the box. Specifically, the whisker is drawn up/down to the largest/lower observed point from the dataset that falls within this distance. Points outside that range are considered outliers and are shown on the plot as circles.
cols = df.drop(columns=['category','attack','is_benign']).columns
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
df_minmax = df
df_minmax[cols] = min_max_scaler.fit_transform(df[cols])
ax = df_minmax[cols].boxplot(figsize=(12, 8))
plt.setp(ax.get_xticklabels(), rotation=90)
ax.grid(False)
plt.show()
We now color each outlier by the 'is_benign' boolean feature (green = benign traffic; red = malicious):
# Create the boxplot without showing the fliers (outliers)
ax = df_minmax[cols].boxplot(figsize=(12, 8), showfliers=False)
ax.grid(False)
plt.setp(ax.get_xticklabels(), rotation=90)
# Overlay the outliers with colors based on 'is_benign'
for i, col in enumerate(cols, start=1): # start=1 to match boxplot position indexing
data = df_minmax[col]
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
# Identify outliers
outliers = data[(data < lower_bound) | (data > upper_bound)]
benign_status = df_minmax.loc[outliers.index, 'is_benign']
# Plot the outliers with color based on 'is_benign'
plt.scatter([i] * len(outliers), outliers,
c=np.where(benign_status, 'green', 'red'), # Green if benign, red otherwise
label='Outliers', alpha=0.3, edgecolor='k')
plt.show()
What we can conclude:
- The particular distribution of "IAT" variable makes box plot not suitable
- In general the outliers do not match the division benign-malicious traffic, apart from a few features.
Let's explore those features:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Bar plot for 'rst_flag_number'
sns.countplot(
data=df,
x='rst_flag_number',
hue='is_benign',
palette={1: 'green', 0: 'red'},
ax=axes[0]
)
axes[0].set_xlabel('RST Flag Number')
axes[0].set_ylabel('Count')
axes[0].set_title('Count of RST Flag Number by Is Benign')
axes[0].set_yscale('log') # Apply logarithmic scale to y-axis
# Show only a subset of x labels for 'rst_flag_number'
axes[0].set_xticks([0])
axes[0].set_xticklabels([0])
# Bar plot for 'fin_count'
sns.countplot(
data=df,
x='fin_count',
hue='is_benign',
palette={1: 'green', 0: 'red'},
ax=axes[1]
)
axes[1].set_xlabel('FIN Count')
axes[1].set_ylabel('Count')
axes[1].set_title('Count of FIN Count by Is Benign')
axes[1].set_yscale('log') # Apply logarithmic scale to y-axis
# Show only a subset of x labels for 'fin_count'
axes[1].set_xticks([0])
axes[1].set_xticklabels([0])
# Adjust layout for better appearance
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 6))
palette = {1: 'green', 0: 'red'}
sns.scatterplot(
data=df,
x='rst_flag_number',
y='fin_count',
hue='is_benign',
palette=palette,
alpha=0.5
)
# Add labels and a title
plt.xlabel('RST Flag Number')
plt.ylabel('FIN Count')
plt.title('Scatter Plot of RST Flag Number vs FIN Count')
# Show the legend and the plot
plt.legend(title='Is Benign')
plt.show()
Multivariate Anomaly Detection¶
Let's try now a ML based approach.
Let's define the main parameter for these anomaly detection models, 'outlier fraction':
outlier_fraction = len(df[df['is_benign']==0])/len(df[df['is_benign']==1])
print(outlier_fraction)
0.02594302910807866
Elliptic Envelope¶
import warnings
warnings.filterwarnings('ignore', category=UserWarning)
y_pred = EllipticEnvelope(contamination=outlier_fraction, random_state=37).fit_predict(df[cols])
sns.countplot(data=df[y_pred == -1], x='category');
Other Methods¶
# Define models to compare
models = {
"IsolationForest": IsolationForest(contamination=outlier_fraction, random_state=37),
"LocalOutlierFactor": LocalOutlierFactor(contamination=outlier_fraction, novelty=True),
"OneClassSVM": OneClassSVM(nu=outlier_fraction),
}
# Prepare the dataset
results = []
# Apply each model
for model_name, model in models.items():
# Fit and predict anomalies
if hasattr(model, "fit_predict"): # For IsolationForest
y_pred = model.fit_predict(df[cols])
else: # For LocalOutlierFactor and OneClassSVM
model.fit(df[cols].values)
y_pred = model.predict(df[cols].values)
# Convert predictions to DataFrame
df_results = df.copy()
df_results["anomaly"] = y_pred
df_results["model"] = model_name
# Append results
results.append(df_results)
# Combine results into a single DataFrame
results_df = pd.concat(results)
# Filter anomalies
anomalies = results_df[results_df["anomaly"] == -1]
# Plot anomalies grouped by model
plt.figure(figsize=(12, 6))
sns.countplot(data=anomalies, x="category", hue="model")
plt.title("Anomalies Detected by Different Models")
plt.xlabel("Category")
plt.ylabel("Anomaly Count")
plt.legend(title="Model", loc="upper right")
plt.show()
After all, we can infer two main things:
- the malicious traffic isn't really characterised by extreme values, thus overall is not identified by the outliers
- connecting to what we pointed out in our visual analysis chapter, the distributions of our features (which are mostly numeric) follow atypical distributions that do not allow a meaningful description of outliers
CONCLUSION¶
Overall, the supervised models proved to be more effective for our analysis, whereas the unsupervised models were not beneficial for our topic. The supervised models achieved remarkable reliability in detecting attacks, excelling both in binary classification and in handling more complex multiclass scenarios.